[TriLUG] Email: Character Encoding

Cristóbal Palmer cristobalpalmer at gmail.com
Mon Jul 30 18:54:15 EDT 2012


Alan (and other list members),

Jack has already covered the most sane (if oversimplified) answer for
most cases (ie. use utf8 everywhere), but I thought I'd chime in with
a little game I like to call Figure Out How My Name Got Abused. If
stuff looks profoundly weird because you're reading this in the digest
and our digest system is idiotic and execrable, I suggest you look in
the web archive (link in the footer), since that seems to do the nice
thing. But back to the game. The easiest/quickest way for me to play
the game is with a python shell. We'll start by creating a unicode
object that's just the accented o (aka. latin small letter o with
acute, aka. U+00F3, aka. unicode codepoint 00F3). We'll then try to
reproduce some of the ways I've actually seen my name show up in
correspondence (usually from automated systems, but sometimes from
human interlocutors using software that didn't do the
nice/ideal/correct thing).

$ python
Python 2.7.2 (default, Feb 6 2012, 21:42:35)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> S = u'ó'
>>> type(S)
<type 'unicode'>
>>> print S # bad idea generally, but here gives the correct small o with acute
ó
>>> print S.encode('iso-8859-1') # question mark
?
>>> print S.encode('utf8').decode('iso-8859-1') # capital A with a tilde to the third power
ó
>>> print S.encode('utf8').decode('windows-1252') # capital A with a tilde to the third power
ó
>>> print S.encode('utf8').decode('cp500') # capital C followed by an interpunct, only because EBDIC is LOL
C·
>>> print S.encode('utf8').decode('mac_cyrillic') # square root of greater than or equal to
ó
>>> print S.encode('utf8').decode('koi8_r') # Tse followed by Yo
цЁ
>>> print S.encode('utf8').decode('shift_jis') # halfwidth katakana letter te followed by halfwidth katakana letter u
テウ
>>> S.encode('utf8').decode('quoted-printable')
'\xc3\xb3'

My personal favorite is the mac_cyrillic one. Other ways I've seen my
name barfed up include:

Crist??bal
Cristbal
Crist__bal
Crist�_bal
Crist��bal
CristÌ_bal
CristÃ_bal

See if you can produce some of these with your own version of my game.

Let it be known that if I interact with you or your business, and it's
clear to me that your webapp/business/whatever has been modified since
2010, and my name comes back from your webapp/business/whatever as,
for example, capital A with a tilde to the third power, I will
consider you and/or your business idiotic and execrable. There's just
no excuse for this anymore. I could have made excuses in 2009. They
would have been bad excuses, but I could have made them. I just can't
make them today. And in case you think I'm being mean, Joel Spolsky
was angry back in 2003.

Further Reading:

* http://www.joelonsoftware.com/articles/Unicode.html
* http://farmdev.com/talks/unicode/
* http://docs.python.org/library/codecs.html
* http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
* http://www.columbia.edu/~fdc/utf8/ (with special thanks to Kevin Otte)

Thanks,
-- 
Cristóbal Palmer
cmpalmer.org



More information about the TriLUG mailing list