It is impossible not to notice all the political conversations in Sweden now, less then two weeks before election day. During times like these parties focus a lot of energy on getting their point across to the public, but how much is just slogans that sound good when you print them on a poster and how much is rooted in the everyday work of their organisation.
Are the words printed on the posters present in every street corner really the same as the ones being exchanged between the walls of the Swedish parliament building?
While ferociously staying away from the subject of who is right or wrong, let’s see if there is a way to evaluate if what they are talking about in the parliament’s everyday sessions is the same as what is being printed in the manifestos released during the last two elections (2014 and 2018 respectively).
Data science isn’t much without data and luckily for us, the Swedish government is openly publishing a large amount of it through the http://data.riksdagen.se webpage and API. Data is pulled from the API using our in-house solution Findwise i3 and then indexed and visualised using Elasticsearch and Kibana.
Collecting and making data useful
As always some assumptions and simplifications are necessary to make things a bit more efficient and the results more accessible. This time the following steps were made:
- The manifesto data collection was performed on 19-24 July. This means that some the parties manifestos weren’t published at the time, in those cases, the parties webpages with similar information were scraped instead.
- M, C, L, KD all shared one manifesto for the election 2014, this manifesto has been used as input for all those parties for the year 2014.
- The output entries from our chosen Name Entity Recognition and Part-of-Speech tagging model have been cleaned to a certain extent from political jargon and common terms that don’t contain much information about the topic at hand.
Natural Language Processing and other technologies used
Historically being an enterprise search consultancy two of the tools widely used here at Findwise is the search engine Elasticsearch and its companion – visualization tool Kibana. These might not be the first tools a Data Scientist reaches for at times like this, but it is always good to use familiar tools in new ways in order to understand them further. To support the data flow from the API to processing to Elastic and Kibana Findwise i3 was used.
Due to the huge number of sessions, manual processing of the data was out of the question. To be able to get some information about what had been mentioned an automated approach was necessary. In this case, we relied on a Named Entity Recognition model from OpenNLP (http://opennlp.apache.org)* to break out entities such as Persons, Places, Organisations, and a Part-of-Speech tagging model from the same foundation to create Verb-Noun pairs from the among 50000 sessions present in the data between September 2014 and August 2018. The manifestos were also processed using the same pipeline.
One last mention would be the use of Elasticsearch’s significant terms aggregation. This aggregation is used to evaluate how common a term is in a specific subset of the data compare to the complete set. In our case, the significant terms for one single party will most likely (or perhaps should) correspond to the terms related to that party’s differentiating core issues.
Are politicians walking the talk.. or just talking?
The resulting dashboard can be seen here (only in Swedish). By selecting the party of your interest in the dropdown menu you will be able to see what they have talked about and how many speeches per month they have held over the last four years. In the middle of the dashboard, we have listed results from the POS and NER models together with a word cloud showing the most frequent Verb-Noun pairs used. The Verb-Noun pairs listed from both sources are using the significant terms aggregation to show if what is mentioned in the parties manifestos and sessions is the same or not. Last but not least, at the bottom is the raw data presented in backward chronological order for those who would like to dig deeper into the raw data behind it all.
The power of open data
Due to the minimal manual processing made there might not be safe to draw any conclusions from the dashboard itself, but I hope that it is enough to stir up some thoughts in time for the upcoming election day. One thing that is for sure is that the Swedish government have a great amount of data available freely and for a curious individual with the right tools there is much to be learned.
* 2018-09-04: The first version of the post mentioned that the model used for Named Entity Recognition was from OpenNLP but as my colleagues specializing in NLP here at Findwise made me aware of that OpenNLP has no model available for the Swedish language. The model used is instead an in-house developed and trained model built on the Mallet (http://mallet.cs.umass.edu) Java package.