How you can identify the author of a document

Click here to see a live online demo of the neural network forensic stylometry model described in this article.

In 2013 JK Rowling, the author of the Harry Potter series, published a new detective novel under the pen name Robert Galbraith. She wanted to publish a book without the hype resulting from the success of the Harry Potter books.

However, following a tip-off received by a journalist on Twitter, two professors of computational linguistics showed that JK Rowling was highly likely to be the author of the new detective novel.

How did they manage to do this? Needless to say, the crime novel is set in a strictly non-magical world, and superficially it has little in common with the famous wizarding series.

One of the professors involved in the analysis said that he calculates a “fingerprint” of all the authors he’s interested in, which shows the typical patterns in that author’s works.

What’s my linguistic fingerprint? Subconsciously we tend to favour some word patterns over others. Is your salad fork “on” the left of the plate, or “to” the left of the plate? Do you favour long words, or short words? By comparing the fingerprint of a mystery novel to the fingerprints of some known authors it’s possible to get a match.

Here are some (partial) fingerprints I made for three well known female authors who used male pen names:

Identifying the author of a text is a field of computational linguistics called forensic stylometry.

With the advent of ‘deep learning’ software and computing power, forensic stylometry has become much easier. You don’t need to define the recipe for your fingerprint anymore, you just need lots of data.

My favourite way of approaching this problem is a Convolutional Neural Network, which is a deep learning technique that was developed for recognising photos but works very well for natural language!

The technology I’ve described has lots of commercial applications, such as

  • Identifying the author of a terrorist pamphlet
  • Extracting information from company financial reports
  • Identifying spam emails, adverts, job postings
  • Triage of incoming emails
  • Analysis of legal precedents in a Common Law system

If you have a business problem in this area and you’d like some help developing and deploying, or just some consulting advice, please get in touch with me via the contact form.

On 5th July 2018 I will be running a workshop on forensic stylometry aimed at beginners and programmers, at the Digital Humanities Summer School at Oxford University. You can sign up here: http://www.dhoxss.net/from-text-to-tech.

Update: click here to download the presentation from the workshop.

Matchmaking with deep learning

If you’ve ever bought something on Amazon or other large online retailers, you’ll have noticed the ‘similar products’ that the site recommends to you after you’ve made your purchase. Sometimes they’re not the best suggestion, but in my experience most of the time they hit the mark.

This is an area of machine learning called recommender systems.

How do recommender systems work? In the case of online retailers, the standard approach is to fill out huge matrices and work out the relationships between different products. You can then see which products normally go together in the same basket, and make recommendations accordingly. This is called collaborative filtering and it works mainly because most products have been purchased thousands or millions of times, allowing us to spot the patterns.

Now imagine you run a dating website. Let’s simplify and say your site only caters for male-female pairings. How do you recommend a female to a male user who’s just registered?

This is when things get tricky. There are many users, new users are registering all the time, and most users have made few contact requests.

In this case we can work with what we do have:

  • The user’s profile text
  • The profile photo
  • The contact requests, if any.

One approach which I like to use is a deep learning approach called vector embeddings, which goes like this:

  • You can convert every profile text into a ‘fingerprint’. For example it could be a vector in 100-dimensional space.
  • The 100-dimensional vector by itself is meaningless, but people with similar tastes should end up with similar vectors.
  • If you want to make recommendations for a new user, you can calculate their vector, and the distance to other vectors, and find its nearest neighbours!

Of course the tricky bit is how to go from a profile text and image, to a vector. This is something that Convolutional Neural Networks (CNNs) are very good at.

Vector embeddings can be useful for making recommendations in other industries too:

  • Recruitment websites, where candidates have uploaded a CV and you want to recommend jobs.
  • Property sales, where you have a description of the house and a photo.

There are off the shelf recommender systems that you can use for online retail or movie recommendations. But for text or image based recommendations really you need a custom solution, and this is extremely complex to build.

I have set up Fast Data Science Ltd to provide consulting services in this area after 10 years’ experience working with machine learning on natural language data. If you have lots of text or image data and you’d like to build a custom recommender system I’d love to hear from you. Please contact me here.