You’re on Mac OS X (somewhere around 10.7.5) and you’re using the sed command to replace characters from the latin1 or Windows-1252 character encoding with their utf8 equivalents. Unfortunately you get an error like the following:
sed: 1: "s/#/’/g ": RE error: illegal byte sequence
Luckily you’re not alone!
This happened to me while working on HamDecks, a small project that creates Mnemosyne decks to help you study for the Amateur Radio Operator exams using questions from the official ARRL Question pools. The source question pool files (Technician, General, Extra) though have some problems… There’s a lot of characters with strange/exotic encoding in the ARRL pool files that could not be imported into Mnemosyne. That’s how I got myself into this whole mess in the first place.
The stackoverflow link above makes two suggestions:
Your Mileage May Vary, but neither of those suggestions worked for me. So what did work then?
Once again, we will visit our system locale settings.
Here’s what worked for the HamDecks project:
Instead of just prefixing the sed command with LANG=C, we prefix it with LANG=C LANG_ALL=C. I’m not saying this is a silver bullet, just that it worked for me and might work for you too.
Background: I run this server through Slicehost, and I enjoy their service immensely. When you set up your first server, or rebuild an existing server you get a very minimal GNU/Linux system installed. For obvious reasons, I like this a lot too.
The problem: Both the first time I built this server, and most recently when I rebuilt it to Jaunty Jackalope, the system locales weren’t configured. I understand why this is done, that it happens doesn’t bother me. That I had a hard time finding out how to properly set my locale frustrated me a little bit.
How do you know if your locales aren’t correctly defined? On my Jaunty Jackalope system I see messages like this:
locale: Cannot set LC_MESSAGES to default locale: No such file or directory locale: Cannot set LC_ALL to default locale: No such file or directory
I tried running dpkg-reconfigure locales, but that had no effect. Searching the Internet for the messages above provided a couple of possible solutions, but none of them looked like anything I was interested in. I’m a firm believer that if the Internet tells me to run a command with more than a couple of options, that it may work, but there is probably an easier, less cryptic solution. For example:
localedef -v -c -i en_US -f UTF-8 en_US.UTF-8
No way I’m running that. I instead searched for “slicehost locale” and found this article: Ubuntu Hardy setup. I enjoy this much more:
locale-gen en_US.UTF-8
update-locale LANG=en_US.UTF-8
Turns out that update-locale is a Debian/Ubuntu specific command. It updates your systems default locale setting file. I had checked for one before running it and found that none existed yet on my system. After running those two commands above I found one had been created with “LANG=en_US.UTF-8” in it. It’s possible that running update-locale could have been all I needed to do to begin with.
I hope this helps some one else whose had this problem before or for the first time.
Update: 2013-05-25: This post has reached more parts of the Internet than I ever thought when I wrote it 4 years ago. Thanks to everyone who linked back instead of just copy and pasting the solution directly.
These days I’m running Fedora on Linode. And all is well.