2018-11-30

Next generation Business Intelligence

In this commercial for SAP Business Warehouse generation one zillion42, they claim as they did in the previous one zillion41 generations, now for the first time data can be combined to make astonishing revelations never possible before. Give the guys a few more years and they claim the same thing for the onezillion43rd time. To be fair all the other big guys do the same thing. When will these superduper data or business warehouses, data lakes, landing zones be on par with my 2006 generation 2 Data Warehouse? I suspect it will take some time, all these data warehouses have one intrinsic problem, the data model, you know these extended snowflake Kimball superduper cubes, that makes it complex to develop apps, slow to retrieve and super slow to update the information. Have you ever followed the data from source systems to useful applications in the monolithic data warehouse systems of the big guys? If not do that and then ask yourself, can this not be done simpler, faster and more cost effective? Addressing this may create the real next generation of Business Intelligence.  But still the big guys are patching on data models conceived in the 1970ties.
   

2018-11-13

Mojibake blues

UTF-8 coding conversions is a never ending source of grievance and agony. Lately I been fighting with
UTF-8 in Python 2.7. I needed to export data from Oracle and MS SQL Server databases and write the
data to CSV files. Dead simple except these foreign non- english letters. Of course the normal CSV
writer did not work.
After quite some googling I found the UnicodeWriter class, that I could use as a replacement for the
normal CSV writer. UnicodeWriter promised to solve all my cute foreign letter troubles. It is the
writerrow method that is supposed to magically solve all encoding problems.
class UnicodeWriter:
   """
   A CSV writer which will write rows to CSV file "f",
   which is encoded in the given encoding.
   """


def writerow(self, row):
       self.writer.writerow([s.encode("utf-8") for s in row])
       # Fetch UTF-8 output from the queue ...
       data = self.queue.getvalue()
       data = data.decode("utf-8")
       # ... and reencode it into the target encoding
       data = self.encoder.encode(data)
       # write to the target stream
       self.stream.write(data)
       # empty queue
       self.queue.truncate(0)


Well it did not. All my foreign characters was as distorted as with the regular CSVwriter. I googled a
thousand different solutions, which promised to solve my problem, none did. I also found many requests
for help ‘This drives me bonkers, I have tried lots of suggested options none works.’ Knowing you are
not alone makes you feel better but it does not solve your problem. After hours of googling and testing
I come up with a solution of my own that worked for me. I had to replace the first line of code in the
writerrow method:
def writerow(self, row):
   self.writer.writerow([s.encode("utf-8",'ignore') if str(type(s)) == "<type 'unicode'>" else s  for s in row])
   data = self.queue.getvalue()
   data = data.decode("utf-8")
   data = self.encoder.encode(data)
   self.stream.write(data)
   self.queue.truncate(0)

The thing that really annoys me is you (or at least I) always have to do trial and error debugging to
solve UTF-8 encoding problems. You can read any number of UTF-8 descriptions and any number
of ‘how to solve UT-8 encodings’, in the end you have to resort to trial and error debugging. I know my
solution will not solve all Python 2.7 UTF-8 encoding problems. I’m happy if it solves one problem and
maybe can be inspiration for some to solve their UTF-8 problems.