2018-11-13

Mojibake blues

UTF-8 coding conversions is a never ending source of grievance and agony. Lately I been fighting with
UTF-8 in Python 2.7. I needed to export data from Oracle and MS SQL Server databases and write the
data to CSV files. Dead simple except these foreign non- english letters. Of course the normal CSV
writer did not work.
After quite some googling I found the UnicodeWriter class, that I could use as a replacement for the
normal CSV writer. UnicodeWriter promised to solve all my cute foreign letter troubles. It is the
writerrow method that is supposed to magically solve all encoding problems.
class UnicodeWriter:
   """
   A CSV writer which will write rows to CSV file "f",
   which is encoded in the given encoding.
   """


def writerow(self, row):
       self.writer.writerow([s.encode("utf-8") for s in row])
       # Fetch UTF-8 output from the queue ...
       data = self.queue.getvalue()
       data = data.decode("utf-8")
       # ... and reencode it into the target encoding
       data = self.encoder.encode(data)
       # write to the target stream
       self.stream.write(data)
       # empty queue
       self.queue.truncate(0)


Well it did not. All my foreign characters was as distorted as with the regular CSVwriter. I googled a
thousand different solutions, which promised to solve my problem, none did. I also found many requests
for help ‘This drives me bonkers, I have tried lots of suggested options none works.’ Knowing you are
not alone makes you feel better but it does not solve your problem. After hours of googling and testing
I come up with a solution of my own that worked for me. I had to replace the first line of code in the
writerrow method:
def writerow(self, row):
   self.writer.writerow([s.encode("utf-8",'ignore') if str(type(s)) == "<type 'unicode'>" else s  for s in row])
   data = self.queue.getvalue()
   data = data.decode("utf-8")
   data = self.encoder.encode(data)
   self.stream.write(data)
   self.queue.truncate(0)

The thing that really annoys me is you (or at least I) always have to do trial and error debugging to
solve UTF-8 encoding problems. You can read any number of UTF-8 descriptions and any number
of ‘how to solve UT-8 encodings’, in the end you have to resort to trial and error debugging. I know my
solution will not solve all Python 2.7 UTF-8 encoding problems. I’m happy if it solves one problem and
maybe can be inspiration for some to solve their UTF-8 problems.

No comments:

Post a Comment