Substance: Text Encoding Conversion

I recently learnt about text encoding and was motivated to write a simple program to convert the MP3 tags in batches (most of my Chinese songs' tags were not encoded in UTF-8, the standard across many platforms nowadays). I will try to give a list of the essentials about text encoding and conversion and then talk a bit about the program I wrote.

What are Encoding, Decoding and Conversion?
1. the characters, i.e. symbols, need to be stored in a (binary) physical representation on the computer. The mapping from the symbols to the physical representation is called encoding and the inverse mapping is called decoding. For example, when you read a text file from your hard drive to display on the screen, the program decodes the file content to know what to draw on the screen, and when you save the text file, the program encodes the content into the file. As you might imagine, there are many encoding schemes or codec out there. This creates a problem when a program reads a file with a codec different from the one used to save it.

2. Conversion is a mapping from one physical representation to another such that the decoded text of the output are the same as the decoded text of the input. Conversion is tedious since we need to construct the mapping between the desired encoding pair that might follow very different structures and thus it is hard to automate. (If you are lucky, people have written it before you.) Then Unicode comes to rescue us. Unicode solidified the identity of symbols, which are the abstract beings that we human really care, into code points for each linguistic symbol in every major languages in the world. So now, we can simplify the conversion between any encoding pair by first decoding the binary to Unicode -- this mapping is written for most encoding normally used -- and then encoding them into the desired encoding. Python provides very good Unicode support and Python 3's string is represented in Unicode (handy in handling file names).

3. It is hard for a program to determine what encoding the file is saved in given only the file content in binary. There are various protocols to communicate this information, e.g. the charset declaration in HTTP response and Windows's BOM prefix to txt files. Another solution is to agree on using an encoding large enough to accommodate most, if not all, linguistic symbols, so every program from now on can assume this encoding: like everything is written down in the same language. The standard now is UTF-8 and so you usually want to convert text encoded in other encoding to UTF-8 for compatibility with latest software.

What does my program do?

As you expect, it converts MP3 tags encoded in <encoding> into UTF-8. You need to supply your guess for <encoding> -- I will talk briefly about auto-detecting encoding later. In default, I set the guess to GBK, the most common encoding present in Chinese songs' MP3 tags. This is the culprit that caused all the mess in displaying song information on mobile devices. For a list of acceptable codecs, look here: http://docs.python.org/library/codecs.html#standard-encodings. One cool feature of my program is that it converts the tags character by character and preserves well-formed UTF-8 characters. This design has two advantages. First, I observed that some tags have mixed characters in UTF-8 and other encodings: the other encoding does not have the needed character so some characters were encoded into UTF-8. This technique solves the problem easily. Second, more importantly, this technique makes it safe to run this program multiple times over the same files because it won't change the previously converted content in proper UTF-8.

Using my program

To use this utility (you will need python and mutagen library), download and save the program here: https://gist.github.com/2578542. And in your command prompt or shell, type:

python mp3_tag.py [<dir>]

where <dir> is the directory of MP3 files (handles the current directory if <dir> is missing). This is an example output (a log file will be created too):

A note for Windows users

I used mutagen to access and save MP3 tags. It reads many different formats of MP3 tags but saves all output tags uniformly in ID3v2.4 format which works fine on mobile devices and various modern MP3 players but it is not supported by Win 7 file explorer and the Windows Media Player. (They only support tag versions up to ID3v2.3.) So after conversion, you will see many "?" in the MP3 tags. To solve this problem, you will need to convert the tags to ID3v2.3 with other tools. I used iTunes to do that (right click on the highlighted song(s) and choose "Convert ID3 Tags..." and then select ID3v2.3):

How to detect the encoding?

This is a nontrivial problem because it takes more than a program to determine whether the decoded text makes sense. The funny symbols you get by using a wrong codec are perfectly normal in a different language. It takes a lot intelligence to determine if you use the correct codec. Some modern browsers provides some capabilities to detect the encoding on a page. One idea would be to see if the decoded text fall into some range of frequent characters of some languages. But this only solve some wrong guesses since a page might contain a wide mix of symbols than your assumption. In the context of MP3 tags, let's suppose that Big5 and GBK both decode this same binary but to very different texts and so the assumption on songs using only frequent characters is not a good one because some songs' names do use rare characters. One solution I thought of is to use search engines to select the codec that gives most result on the web: one problem is that many songs are decoded wrongly in the same way of the one sitting in your computer (so wrongly decoded texts often yield much more result than you expect). The other solution I thought of (and actually used to fix conversion errors) is to use SoundHound, a very cool app that can reverse search songs, to listen to a song and return me the song information. (This method could also help you recover the names of forgotten songs in your computer.) Right now, it is a bit silly and time-consuming to do this to a lot of songs as it must be done by hand (unless you reverse engineer SoundHound). I am hoping that SoundHound will release an API soon so this could be automated. (Besides one can use its API do make a Sing Something!)

Coda

Regarding the source code, I do not provide any warranty but you are free to do whatever with my code. It edits some tags of your MP3 files, specifically only the album, artist, album artist, performer, and genre fields (you can modify the code to edit more or less fields) so you might wanna test it on a few duplicates before running it on your entire music library.

Substance

Pages

5/02/2012

Text Encoding Conversion

No comments:

Post a Comment