My Master's Thesis & The Lessons Learnt

Monday. April 10, 2017 - 2 mins

Halfway through my master’s thesis, I think it’s time to look back and reflect on what I’ve learned so far.

Fact or Fiction?

During the 2016 U.S. Presidential election, there was an unprecedented amount of fake news. These articles were written by entities either trying to make money or influence the outcome. Before the rise of digital media fake news could be fact-checked by journalists. Today that’s impossible due to the sheer amount of articles written every day.

A solution to this problem would be to use computers to do basic fact-checking. The method we chose was to create a fact database or knowledge-base and then evaluate the statements factuality against our database.

Since we’re computer scientists we didn’t want to manually gather facts. Instead, we wanted the computer to construct the knowledge-base from trusted facts on the internet using the latest in natural language processing and machine learning.

The Three Lessons

During the project, we worked with Spark to process big data corpora and used libraries to perform the information extraction. This project has challenged me while at the same time being incredibly fun. Along the way, I’ve learned a couple of lessons that I want to pass on.

Real-world data is dirty

We worked a lot with free and open corpora such as Wikipedia. They are great but the data is often very dirty. When you construct your pipeline you imagine for example that sentences are not longer than a handful of words. In reality, some outliers are thousands of words long. Eventually, you’ll learn to filter everything with upper bounds, lower bounds, and restrictive regexes.

Data can explode

Once you got your filtered, normalized, and reasonable dataset everything is fine right? No, that small 1.4GB dataset of sparse vectors might suddenly get turned into dense vectors because one library didn’t support sparse vectors and converted them silently internally.

So now it doesn’t require 1.4GB of memory, it requires 80GB, and executors are dying like they’re bad guys in a Rambo movie.

The downside of open source libraries

While working on your pipeline you might stumble upon an open source library. It has many stars on Github, the webpage looks solid and it promises to do everything you need. Little do you know that you’re going to spend the next one and half weeks resolving obscure dependencies, locating memory leaks and interpreting poorly worded documentation.

The lesson here is, don’t use libraries unless you really need them. Do you only need a small function? Write it yourself because TANSTAAFL and all libraries come with baggage.

Erik Gärtner

Ph.D. in Computer Vision and Machine Learning