My Master's Thesis & The Lessons Learnt- 2 mins
Halfway through my master’s thesis I think it’s time to look back and reflect on what I’ve learnt so far.
Fact or Fiction?
During the 2016 U.S. Presidential election there were an unprecedented amount of fake news. These articles were written by entities either trying to make money or influence the outcome. Before the rise of digital media fake news could be fact checked by journalist. Today that’s impossible due to the sheer amount of articles written every day.
A solution to this problem would be to use computers to do basic fact checking. The method we chose was to create a fact database or knowledge-base and then evaluate the statements factuality against our database.
Since we’re computer scientists we didn’t want to manually gather facts. Instead we wanted the computer to construct the knowledge-base from trusted facts on the internet using the latest in natural language processing and machine learning.
The Three Lessons
During the project we worked with Spark to process big data corpora and used libraries to perform the information extraction. This project has really challenged me while at the same time being incredibly fun. Along the way I’ve learnt a couple of lessons that I want to pass on.
Real world data is dirty
We worked a lot with free and open corpora such as Wikipedia. They are great but the data is often very dirty. When you construct your pipeline you imagine for example that sentences are not longer than a handful of words. In reality there are outliers that are thousands of words long. Eventually you’ll learn to filter everything with upper bounds, lower bounds and restrictive regexes.
Data can explode
Once you got your filtered, normalized, and reasonable dataset everything is fine right? No, that small 1.4GB dataset of sparse vectors might suddenly get turned into dense vectors because one library didn’t support sparse vectors and converted them silently internally.
So now it doesn’t require 1.4GB of memory, it requires 80GB, and executors are dying like they’re bad guys in a Rambo movie.
The downside of open source libraries
While working on your pipeline you might stumble upon an open source library. It has many stars on github, webpage looks solid and it promises to do everything you need. Little do you know that you’re going to spend the next one and half weeks resolving obscure dependencies, locating memory leaks and interpreting poorly worded documentation.
The lesson here is, don’t use libraries unless you really need them. Do you only need a small function? Write it yourself because TANSTAAFL and all libraries comes with baggage.