In Short

Part of the KMPG JADS Postmaster Data Science and Entrepreneurship, a somewhat academic joint program offered by Tilburg University, Eindhoven University of Technology and KPMG, is a final business related project with a high emphasis on Data Science. Business-related in the sense that you were required to find a business that would support you with a business case or tooling including a final presentation, hopefully leading to some sort of business relationship with the tool or client but at at least somewhat academic depth. So it’s a little bit of both worlds. Whereas some courses were more mathematical and technical, others completely focused on the basics of scrum, company culture and leadership. The whole program takes about 1.5 year with about 8 hours per week.

For our final project we chose to improve the KPMG internal client risk assessment & due diligence, a crucial step during the client intake process regarding the client’s lawfulness, by using Owlin and their news aggregation service and NLP to both collect discrediting news articles and forecast whether or the client should be marked risky. There are various reasons for this internal department to prevent a project at a client from starting or to prevent a company from becoming a client in the first place, but most are in the area of criminal- and terroristic money laundering. If a company has been affiliated with such things, or there is a good possibility it has, it is probably not a ethical to take on the company as a client.

Details

With Owlin taking over the elaborous work of collecting news articles, we merely focus on gathering that data, storing it and running NLP analyses on top of it. This mostly revolves around optimizing API Queries, storing the data into CSV’s small enough to be read and analysed individually and running NLP algorithms against it.

Note that we stayed away from Big Data type of solutions, because of the limited scope and timeline and the focus of the program on Data Science, Solutions and the business rather than the more technical elements of it.

Back at this time, the best NLP algorithms around were ELMo, BERT and Fasttext, which we thus experimented with throughout. Additionally, we made use of the Zalando Flair project to further structure our pipelines and automate various steps in between. I’m not too sure how this field looks like today, but looking at the GLUE Benchmark, I must admit that I am not certain how much the field has improved. The tests for benchmarking seem to have changed over time and there is not a single algorithm on leaderboard that rings a bell for me, so I’m going to assume it has gotten way better. Surely it has right? I mean we have ChatGPT nowadays, so it has to be.

Anyway, we would setup NLP pipelines through Zalando Flair, apply the noted algorithms to obtain word embeddings and use those in our further analysis. These word embeddings are merely a translation between texts and numerical vectors, and are therefore necessary to transform the input into something we can actually use when running our machine learning algorithms. Personally, I’m a mathematician-statistician, so I tend to cringe a little bit when applying machine learning models without understanding whether or not their assumptions and underlying ideas make sense, but you go with it. Everybody seemed to be doing it and Machine Learning & AI were everywhere, so, why not. We applied several known algorithms such as XGBoost, LightGBM, logistic regresssions, K-nearest-neighbours, various setups of neural nets and finally also elastic net. Generally, we would not really obtain very good results, predicting on word embedding appeared to be more difficult than expected but we were able to reliably predict sentiment (positive, negative) using clustering. So we stuck to that with our final result.