OpenSchufa: Three things we’ve learned from parsing 2800 credit reports

10/21/2019

In February 2018, the organisations Open Knowledge Foundation Germany (OKFN) and AlgorithmWatch launched their crowdsourcing project OpenSCHUFA. Their goal was to "crack" Schufa, the most influential credit agency in Germany, and to learn more about its scoring algorithm. The participants requested their free personal credit report and provided scans or pictures of it to the project. We at SPIEGEL Data had the opportunity to work with the documents together with our colleagues from BR Data.

You can find the full project by SPIEGEL Data here and the BR Data version here.

1. Commit, pull, push

I really hate Git. It’s the most unintuitive thing in the world. But this project wouldn’t have been possible without a version control system: We were roughly four people working on the document parser. The BR colleagues are located in Munich and we are based in Hamburg.

We used R and Python to build the parser’s different parts. Our days were a constant commit, pull, push, commit, pull, push. For security reasons, we only pushed the scripts themselves into the Git system. The documents stayed on our local machines and aggregated datasets were only created and saved locally by the parser.

2. Check your OCR settings

I really like Abbyy FineReader. I've used it every now and then over the last years to extract text and spreadsheets from pictures like the ones a non-searchable PDF consists of. This is a process known as OCR – optical character recognition. So I was very happy when OKFN and AlgorithmWatch decided to use FineReader for the uploaded credit reports.

I had never thought about the different OCR settings FineReader offers. Its presets just worked fine for me – until this time. In Schufa’s credit report, you have information like dates (when Schufa transmitted a person’s score to a company) and percentages (the score itself says how likely it is that a person will meet his or her payment obligations).

Screenshot of Schufa report

FineReader exports documents as Excel files. When we imported these files into our R and Python scripts, we discovered that there were some weird-looking values in the date column like "43183". And in the score column, we sometimes had values in a string format like "97.12%" and sometimes in a float format like "0.9712".

So, in some cases, FineReader must have recognized dates and percentages and converted them into the Excel format: dates are saved as days since January 1st 1900 (it’s sad because it’s true) and percentages are represented as decimal numbers between 0 and 1.

To avoid that, you have to uncheck the box "Convert numeric values to numbers". This way, the actual values are preserved as strings.

3. Fuzzy matching and regex

From both the tables of transmitted scores (screenshot above) and the financial histories (screenshot below), we wanted to extract specific information.

Screenshot of the Schufa history

In a person’s financial history, you can find banks reporting the creation of a new account or loan, phone companies reporting a contract, online retailers requesting credit information. For each entry in a person’s financial history we wanted to know: What kind of entry is it (bank account, loan, phone contract, credit information request, …)? When was it reported? And by which company?

To identify what kind of entry a paragraph was, we used fuzzy matching with Levenshtein distance. It’s a way to check if a certain phrase appears in a text. And you’re able to specify how exact the match must be. In OCRed documents you’ll find misspellings like "Girakonto" instead of "Girokonto", where the OCR software confused a letter for another. With fuzzy matching you can handle such misspellings.

To extract an entry’s date and company, we used regex, short for regular expressions.

This is a type of language widely used to find patterns in text. For example, "." is used as a placeholder for any character, "+" stands for "more than once".

In our documents we had patterns like "On [DATE] did [COMPANY] note that, …" or "[COMPANY] noted that on [DATE] …". Extracting the dates was pretty easy. We used the regex query:

\d{2}[.,]\d{2}[.,]\d{4}

It can be "translated" as:

\d{2} two digits
[.,] followed by either a period or a comma (for OCR errors)
\d{2} then two more digits
[.,] again a period or a comma
\d{4} and finally four digits.

To extract the companies’ names, we had to prepare different regex queries for the different types of entries. Sometimes the name is in a paragraph’s beginning, sometimes it comes later. So we used queries like "did (.+) note" and "^(.+) noted that on". The text snippet before and after the companies’ names had to be long enough to ensure that no false positives were caught. And they had to be as short as possible to avoid unnecessary OCR errors.

If anyone knows a good Python library that combines fuzzy matching and regex, please let me know.

\d{2}[.,]\d{2}[.,]\d{4