Renaming files with asian characters

/dev/stderr · 07-19-2010, 01:58 AM

Hi,

I have a kind of strange problem that I haven't been able to resolve for a couple of days now so I thought I'd ask to see if anyone else had come across this.

I have a bunch of files that I need to rename, ordinarily this is pretty easy task. The problem here is that the file names have Chinese / Japanese characters (sorry for my ignorance I can't tell the difference).

ie [$$$$$$$$].SOMETHING BLAH BLAH.ext

Where all the "$$$$" are insert Chinese characters.

The problem is that sed or perl doesn't seem to handle the Chinese characters correctly so using a regular expression like this 's/^[*.]//' which would normally work doesn't.

From what I have read so far I believe these characters are double encoded UTF-8 (not 100% sure) which could be the problem.

So far I've tried numerous different regex's as well as playing around with convmv to see if I could convert the filenames to just single encoded characters but I've had no luck.

Has anyone else come across this? I don't really want to rename 100+ files by hand.

Cheers,
/dev/stderr

David the H. · 07-19-2010, 06:09 AM

Yes, it can be quite difficult to work with foreign language filenames on the command line. For CJK, the encodings are usually 2-4 bytes long, and unless you have the appropriate IM and knowledge, almost impossible to work with directly.

First of all, sed does work with unicode. However the example pattern you gave above is all wrong. The "wildcard" is ".*" (not "*."), and they shouldn't go inside brackets unless you're trying to match literal characters. So...

Code:

sed 's/^.*SOMETHING/SOMETHING/'

...should strip all the characters in front of "SOMETHING".

Second, try using globbing to grab the file by the part of the name you can type.

Code:

mv "*SOMETHING BLAH BLAH.ext" "new_file_name.txt"

Now a more complicated method. The uniname command (available in the uniutils package) will show you the multi-byte encodings of the individual characters, which you can then use in the shell, with a bit of trickery.

Code:

$ ls *file1.txt
日本語file1.txt

$ ls *file1.txt |uniname -bnpu
character  encoded as     glyph
        0   E6 97 A5       日
        1   E6 9C AC       本
        2   E8 AA 9E       語
        3   66             f
        4   69             i
        5   6C             l
        6   65             e
        7   31             1
        8   2E             .
        9   74             t
       10   78             x
       11   74             t
       12   0A

$ mv $'\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9Efile1.txt' new_file_name.txt

$'' is a bash shell pattern which expands various escape sequences into their actual characters. Octal byte-codes like the ones above can be used in it in the pattern "\xNN".

This is cumbersome though, and only really good for scripting purposes. I'm only including it here for completeness.

One option I recommend is to use the qmv command from the renameutils package. qmv will load all the filenames given to it into a text editor, which you can then edit by hand. When you save the file, it will automatically rename all the files at once.

The following uses my personal set-up for qmv, which places the from and to names on sequential lines. You can set it up as an alias if you want. The editor used is determined by your $EDITOR environmental variable, or by the -e option.

Code:

$ ls *
日本語file1.txt
本日語file2.txt
語日本file3.txt
語本日file4.txt

$ qmv -v -f sc -o separate,indicator1='f|',indicator2='t|' *

#outputs to nano:

f|\346\227\245\346\234\254\350\252\236file1.txt
t|\346\227\245\346\234\254\350\252\236file1.txt

f|\346\234\254\346\227\245\350\252\236file2.txt
t|\346\234\254\346\227\245\350\252\236file2.txt

f|\350\252\236\346\227\245\346\234\254file3.txt
t|\350\252\236\346\227\245\346\234\254file3.txt

f|\350\252\236\346\234\254\346\227\245file4.txt
t|\350\252\236\346\234\254\346\227\245file4.txt

Unfortunately though, it doesn't seem to want to send them to the editor as pure unicode strings, so all you get are bytecodes. Notice that in this case the bytes are displayed in decimal form instead of octal, but the basic concept is the same. Just rename the "t" lines (but don't touch the "f" lines) and save and close the editor.

Finally, there are other bulk renamers out there (mostly gui), such as krename and pyrenamer.