The good news is that I am still on holiday. These days I am playing with some python code to read data from some of our systems and perform some analysis before the new fiscal year officially starts.
It started as a small project and a relaxing activity. It was not strictly work; I was playing with data and python. I love coding, and I always do something when I have some free time.
I ended up with five thousand lines of code.
This specific exercise was an ETL (Extract, Transform, Load). Two main systems were involved: Salesforce and Google Drive. I wanted everything on a Sqlite database to wrangle data with pandas and NumPy.
Extracting data was easy. Both Salesforce and Google Drive have very well-documented APIs. The transformation was tricky. Every system has its way of representing data. Specifically, date and time management is always a massive pain between different systems. The load was a breeze.
Finally, I made it. My database was loaded with data.
I ran the tests against data integrity, and something was wrong. After a SQL join, I expected 3036 rows and got 3108. It took me an hour to find the culprits: I forgot to disallow duplicates in a database field, and there was duplicated data in Google Drive.
I could have deleted the duplicated data in Google Drive, but I did not want to do it. I am not the only one accessing and using that data.
I modified my code to cope with duplicated data. Well, it almost doubled the size of my codebase. I could not simply discard the duplicate. I had to merge the data in a table with 32 different fields. Each field with specific requirements.
It was exciting and intriguing.
Sometimes you spend more time coding edge case management than the core application logic.