The Problem

You're on Mac OS X (somewhere around 10.7.5) and you're using the sed command to replace characters from the latin1 or Windows-1252 character encoding with their utf8 equivalents. Unfortunately you get an error like the following:

> > sed: 1: "s/#/’/g > ": RE error: illegal byte sequence > >

Luckily you're not alone!

This happened to me while working on HamDecks, a small project that creates Mnemosyne decks to help you study for the Amateur Radio Operator exams using questions from the official ARRL Question pools. The source question pool files (Technician, General, Extra) though have some problems… There's a lot of characters with strange/exotic encoding in the ARRL pool files that could not be imported into Mnemosyne. That's how I got myself into this whole mess in the first place.

Options

The stackoverflow link above makes two suggestions:

  1. Use the iconv utility

  2. Use a PERL one-liner

Your Mileage May Vary, but neither of those suggestions worked for me. So what did work then?

Potential Solution

Once again, we will visit our system locale settings.

Here's what worked for the HamDecks project:

https://gist.github.com/6685995

Instead of just prefixing the sed command with LANG=C, we prefix it with LANG=C LANG_ALL=C. I'm not saying this is a silver bullet, just that it worked for me and might work for you too.