Measuring the accuracy of AI for healthcare?

Left: a benign mammogram, right: a mammogram showing a cancerous tumour. Source:
National Cancer Institute

You may have read about the recent Google Health study where the researchers trained and evaluated an AI model to detect breast cancer in mammograms.

It was reported in the media that the Google team’s model was more accurate than a single radiologist at recognising tumours in mammograms, although admittedly inferior to a team of two radiologists.

But what does ‘more accurate’ mean here? And how can scientists report it for a lay audience?

Imagine that we have a model to categorise images into two groups: malignant and benign. Imagine that your model categorises everything as benign, whereas in reality 10% of images are malignant and 90% are benign. This model would be useless but would also be 90% accurate.

This is a simple example of why accuracy can often be misleading.

In fact it is more helpful in a case like this to report two numbers: how many malignant images were misclassified as benign (false negatives), and how many benign images were misclassified as malignant (false positives).

The Google team reported both error rates in their paper:

We show an absolute reduction of 5.7%… in false positives and 9.4%… in false negatives [compared to human radiologists].

McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020)

This means that the model improved in both kinds of misclassifications. If only one error rate had improved with respect to the human experts, it would not be possible to state whether the new AI was better or worse than humans.

Calibrating a model

Sometimes we want even finer control over how our model performs. The mammogram model has two kinds of misdiagnoses: the false positive and the false negative. But they are not equal. Although neither kind of error is desirable, the consequences of missing a tumour are greater than the consequences of a false alarm.

For this reason we may want to calibrate the sensitivity of a model. Often the final stage of a machine learning model involves outputting a score: a probability of a tumour being present.

But ultimately we must decide which action to take: to refer the patient for a biopsy, or to discharge them. Should we act if our model’s score is greater than 50%? Or 80%? Or 30%?

If we set our cutoff to 50%, we are assigning equal weight to both actions.

However we probably want to set the cutoff to a lower value, perhaps 25%, meaning that we err on the side of caution because we don’t mind reporting some benign images as malignant, but we really want to avoid classifying malignant images as benign.

However we can’t set the cutoff to 0% – that would mean that our model would classify all images as malignant, which is useless!

So in practice we can vary the cutoff and set it to something that suits our needs.

Choosing the best cutoff is now a tricky balancing act.

ROC curves

If we want to evaluate how good our model is, regardless of its cutoff value, there is a neat trick we can try: we can set the cutoff to 0%, 1%, 2%, all the way up to 100%. At each cutoff value we check how many malignant→benign and benign→malignant errors we had.

Then we can plot the changing error rates as a graph.

We call this a ROC curve (ROC stands for Receiver Operating Characteristic).

This is the ROC curve of the Google mammogram model. The y axis is true positive rate, and the x axis is false positive rate. Source: McKinney et al (2020)

The nice thing about a ROC curve is that is lets you see how a model performs at a glance. If your model is just a coin toss, your ROC curve would be a straight diagonal line from the bottom left to the top right. The fact that Google’s ROC curve bends up and left shows that it’s better than a coin toss.

If we need a single number to summarise how good a model is, we can take the area under the ROC curve. This is called AUC (area under the curve) and it works a lot better than accuracy for comparing different models. A model with a high AUC is better than one with a low AUC. This means that ROC curves are very useful for comparing different AI models.

You can also put human readers on a ROC curve. So Google’s ROC curve contains a green data point for the human radiologists who were interpreting the mammograms. The fact that the green point is closer to the diagonal than any point on the ROC curve confirms that the machine learning model was indeed better than the average human reader.

Whether the machine learning model outperformed the best human radiologists is obviously a different question.

Can we start using the mammogram AI in hospitals tomorrow?

In healthcare, as opposed to other areas of machine learning, the cost of a false negative or false positive can be huge. For this reason we have to evaluate models carefully and we must be very conservative when choosing the cutoff of a machine learning classifier like the mammogram classifier.

It is also important for a person not involved in the development of the model to evaluate and test the model very critically.

If the mammogram was to be introduced into general practice in healthcare I would expect to see the following robust checks to prove its suitability:

  • Test the model against not only the average human radiologist but also the best neurologist, to see where it is underperforming.
  • Check for any subtype of image where the model consistently gets it wrong. For example images with poor lighting.
  • Look at the explanations of the model’s correct and incorrect decisions using a machine learning interpretability package (see my earlier post on explainable machine learning models).
  • Test the model for any kind of bias with regards to race, age, body type, etc (see my post on bias).
  • Test the model in a new hospital, on a new kind of X-ray machine, to check how well it generalises. The Google team did this by training a model on British mammograms and testing on American mammograms.
  • Collect a series of pathological examples (images that are difficult to classify, even for humans) and stress test the model.
  • Assemble a number of atypical images such as male mammograms, which will have been a minority or nonexistent in the training dataset, and check how well the model generalises.

If you think I have missed anything please let me know. I think we are close to seeing these models in action in our hospitals but there are still lots of unknown steps before the AI revolution conquers healthcare.

Thanks to Ram Rajamaran for some interesting discussions about this problem!


Hamzelou, AI system is better than human doctors at predicting breast cancer, New Scientist (2020).

McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020).

Identifying fake news: NLP on the Relotius Affair

Is it possible to identify when somebody is not telling the truth? You may be aware of the subtle body language, tics and signals that give away a liar, but what about the written word?

Furthermore, what if you are reading a news article by a famous reporter and you suspect that it is not true?

I am going to tell you a little about a reporter called Claas Relotius, who was once one of Germany’s most respected reporters, and was later exposed as a fraud and was found to have fabricated hundreds of articles over an eight year period.

Then I will attempt some data science magic on Relotius’ articles, to see what we can learn.

Background: the migrant caravan in the Sonora desert

In 2018, a caravan of several thousand central American migrants were making their way from Honduras through the Sonora desert in Mexico and onwards to the final goal of the United States.

Juan Moreno, a 45 year old freelance reporter, was travelling alongside the migrant caravan and gathering some material for a feature piece for Der Spiegel, a prestigious German news magazine.

Moreno had been tasked with covering the caravan as they travelled through Mexico. He had spent several gruelling weeks in the desert and had already identified two young women who were willing to let him shadow them for a few days.

He was not happy to receive an email from the Spiegel editors saying that his young, successful colleague Claas Relotius would now be working on the article with him and would take editorial control over the final version.

Relotius had won more than 40 prizes in journalism and was widely regarded as a rising star in the field.

Relotius was to travel to Arizona and track down a militia, a group of volunteers who spend their time and money defending the US southern border from the perceived threat of illegal migration, while Moreno would stay in Mexico and continue to report on the migrants.

The suspicious article

After the assignment was finished, Moreno flew back to Germany.

When Moreno received Relotius’ drafts and final article, titled
Jaeger’s border (German: Jaegers Grenze), he felt that something just didn’t feel right. Relotius claimed to have spent a few days in the company of a militia called the Arizona Border Recon. The members of Arizona Border Recon were armed and went by colourful nicknames such as Jaeger, Spartan and Ghost. Relotius even claimed to have witnessed Jaeger shooting at an unidentified figure in the desert. In short, the militia were portrayed as a stereotypical band of hillbillies, and some of the details seemed hard to believe.

Moreno started digging into Jaeger’s Border and Relotius’ articles. He spent his savings on his own private investigation. He travelled in Relotius’ footsteps to Arizona and other locations. It quickly became clear that Relotius had been fabricating stories rather than interviewing the subjects he claimed to have interviewed.

Many of Relotius’ articles relied on stereotypes and the stories seemed far-fetched and too good to be true. For me, the most absurd story centres on a brother and sister from Syria who were working in a Turkish sweatshop. Relotius invented a Syrian children’s song about two orphans who grow up to be king and queen of Syria. According to the article, every Syrian child “from Raqqah to Damascus” is familiar with this traditional song. But none of the Syrians that Moreno spoke to had ever heard of it.

Relotius exposed

After much persistence on behalf of Moreno, the management at Der Spiegel reluctantly investigated Relotius’ articles, and concluded that he had indeed fabricated the majority of his articles during his 8 year tenure.

Relotius had invented interviews that never took place, and people who never existed. He even wrote an article about rising sea levels in the Pacific island of Kiribati without bothering to take his connecting flight to the country.

Der Spiegel issued a mass retraction of the affected articles and the ‘Relotius Affair’ became a nationwide scandal, making news worldwide and prompting an intervention by the US ambassador to Germany who objected to the “anti-American sentiment” of some of the articles.

The article Jaeger’s Border and Relotius’ other texts can be downloaded as a PDF from Der Spiegel’s website. In total 59 articles are available for download, together with annotations by Der Spiegel indicating what content is genuine and what is pure invention.

There is a large amount of English language content available online on the Relotius scandal, including English translations of many of the articles.

Analysing Relotius’ texts

I downloaded all 59 available Relotius articles and Der Spiegel’s annotations and tried a few data science experiments on them.

First of all I checked the truth/falsehood status of the articles. You can see that more than half are fictitious, although there are some articles where it was not possible for Der Spiegel to determine if the article was genuine or not. I excluded the latter from my analysis.

The vast majority of Relotius’ articles were written by him alone. Moreno later stated that this was quite unusual at Der Spiegel for a reporter to take on so many lone assignments, but Relotius was the star reporter at the publication and seemed to have acquired a certain privilege in this regard.

Of course we know now it was easier for him to fabricate content when working alone.

There is something else interesting about the above graph. Relotius wrote only one article in a team of two. The other collaborative articles all involved larger teams of up to 14 authors.

The sole two-author article is Jaeger’s Border, the article which got Relotius caught out!

This shows that Relotius had a pattern of either writing articles alone, or in a large team. He managed to get away with this strategy for years until the Jaeger’s Border assignment. Perhaps when you are collaborating in a large group it is also easier to avoid scrutiny.

Word clouds

I tried generating a word cloud of the true and fake articles, to see if there is any discernible difference. A word cloud shows words in different font sizes according to how often they occur in a set of documents.

Word cloud for the genuine news articles. The largest (most common) word is sagt (says).
Word cloud for the fake news articles.

Unfortunately there is not a huge difference between the two sets.

However I can see some patterns.

  • There is more use of sei, würde in the genuine news articles, which are special verb forms that are used often in reported speech. It appears that the fake news involved more description of direct action and less tentative reporting or reported speech.
  • The word deutschen (‘German’) is more common in the genuine news articles. In Moreno’s book he explained that Relotius only faked his articles that involved travel outside Germany, as it would presumably be harder to make up fake German news for a German audience and make it sound convincing.

Finding the commonest words that distinguish both groups

I then tried a more scientific approach. I used a tool called a Naive Bayes Classifier to find the words which most strongly indicate that an article is genuine or fictitious.

The Naive Bayes approach assigns a large negative number to words that strongly indicate fake news and a smaller negative number to words that indicate genuine news.

Here are the top 15 words that indicate that an article is genuine, with English translations and the scores from the Naive Bayes classifier:

sagt says -8.37
sei is (reported speech) -8.64
mehr more -8.69
immer always -8.77
geht goes -8.79
schon already -8.84
deutschen German -8.85
später later -8.88
nie never -8.89
sagte said -8.89
seit since -8.90
gibt gives/there is -8.92
bald soon -8.92
kommen come -8.92
gut good -8.93

and here some of the top 15 words that indicate that an article is fictitious:

enthaupteten beheaded -9.29
verstümmelten mutilated -9.29
abgeladen unloaded -9.29
gegenwärtig current -9.29
abschrecken scared off -9.29
richteten directed -9.29
glieds member -9.29
öffentlichten published -9.29
umfangreiche extensive -9.29
preisgeben divulge -9.29
zurückgezogen withdrawn -9.29
hackten hacked -9.29
korrupte corrupt -9.29
bloggenden blogging -9.29
lebensbedrohlich life threatening -9.29

This is just a snapshot but we can see some more patterns now. The fake news seems to be quite heavy in strong, emotive or very graphic language such as corrupt or mutilated. When I took the top 100 words this effect is still noticeable.

I then tested to see if it was possible to use the Naive Bayes Classifier to predict if an unseen Relotius text was fake or genuine, but unfortunately this was not possible to any degree of accuracy.


It is not possible to build a fake news detector given that we only have 59 articles to work from, but knowing in retrospect that Relotius falsified some texts, it is definitely possible to observe patterns and significant differences between his genuine and fake articles:

  • The fake articles were written when Relotius was reporting as a lone wolf. Relotius was caught out the first time he was assigned to work in a team of two.
  • The fake articles contain more emotive, graphic or strong language.
  • There is less reported speech and tentative language in the fake articles.
  • Caveat: it’s possible that some of the linguistic differences mentioned above are due to the genuine articles tending to be multi-author pieces.
  • The fake articles take place outside of Germany.

Perhaps knowing these effects it may be possible to flag suspicious texts in the future. If a reporter seems overly keen on working alone, travelling abroad, and seems to interview few subjects, but writes using colourful language that would be more appropriate in a novel, then perhaps something is amiss?

Epilogue: Relotius vs Moreno?

Naturally Relotius’ prizes were revoked and returned one by one, and he resigned from his position at Der Spiegel.

Juan Moreno, the whistleblower who discovered Relotius’ fraud, wrote a tell-all book about the Relotius Affair, titled A Thousand Lines of Lies (Tausend Zeilen Lüge). The book is a fascinating exposé of the world of print journalism in the digital age as well as a first hand account of how Relotius’ system unravelled.

Ironically in 2019 Relotius started legal proceedings against Moreno for alleged falsehoods in the book, which are ongoing at the time of writing.

Appendix: technical details on the Naive Bayes classifier

In case you would like more details, I used a multinomial Naive Bayes classifier with tf*idf scores. I evaluated it below using a ROC curve:

This is a ROC curve showing the performance of my Naive Bayes classifier under cross validation for predicting unseen Relotius texts. A good classifier would have a line close to the top left hand corner. The fact that the line is on the diagonal shows that my predictions were no better than rolling a dice. That means that if Relotius were still at large today I would have no way of knowing if his latest article were fictitious or not.


Juan Moreno, Tausend Zeilen Lüge: Das System Relotius und der deutsche Journalismus (A thousand lines of lies: the Relotius system and what it means for German journalism) (2019).

Claas Relotius, Bürgerwehr gegen Flüchtlinge: Jaegers Grenze (Militia against refugees: Jaeger’s Border), and all other Claas Relotius texts, Der Spiegel (2018).

Philip Oltermann, The inside story of Germany’s biggest scandal since the Hitler diaries, The Guardian (2019).

Ralf Wiegand, Claas Relotius geht gegen Moreno-Buch vor (Claas Relotius takes action against Moreno’s book), Sueddeutsche Zeitung (2019).

(Multinomial Naive Bayes) C.D. Manning, P. Raghavan and H. Schuetze, Introduction to Information Retrieval, pp. 234-265 (2008).

How can we eliminate bias from AI algorithms? The pen-testing manifesto

In recent weeks a number of Apple Card users in the US have been reporting that they and their partners have been allocated vastly different credit limits on the branded credit card, despite having the same income and credit score (see BBC article). Steve Wozniak, a co-founder of Apple, tweeted that his credit limit on the card was ten times higher than his wife’s, despite the couple having the same credit limit on all their other cards.

The Department of Financial Services in New York, a financial services regulator, is investigating allegations that the users’ gender may be the base of the disparity. Apple is keen to point out that Goldman Sachs is responsible for the algorithm, seemingly at odds with Apple’s marketing slogan ‘Created by Apple, not a bank’.

Since the regulator’s investigation is ongoing and no bias has yet been proven, I am writing only in hypotheticals in this article.

The Apple Card story isn’t the only recent example of algorithmic bias hitting the headlines. In July last year the NAACP (National Association for the Advancement of Colored People) in the US signed a statement requesting a moratorium on the use of automated decision-making tools, since some of them have been shown to have racial bias when used to predict recidivism – in other words, how likely an offender is to re-offend.

In 2013, Eric Loomis was sentenced to six years in prison, after the state of Wisconsin used a program called COMPAS to calculate his odds of committing another crime. COMPAS is a proprietary algorithm whose inner workings are known only to its vendor Equivant. Loomis attempted to challenge the use of the algorithm in Wisconsin’s Supreme Court but his challenge was ultimately denied.

Unfortunately incidents such as these are only worsening the widely held perception of AI as a dangerous tool, opaque, under-regulated, capable of encoding the worst of society’s prejudices.

What went wrong?

I will focus here on the example of a loan application, since it is a simpler problem to frame and analyse, but the points I make are generalisable to any kind of bias and protected category.

I would like to point out first that I strongly doubt that anybody at Apple or Goldman Sachs has sat down and created an explicit set of rules that take gender into account for loan decisions.

Let us first of all imagine that we are creating a machine learning model which predicts the probability of a person defaulting on a loan. There are a number of ‘protected categories’, such as gender, which we are not allowed to discriminate on.

Developing and training a loan decision AI is that kind of ‘vanilla’ data science problem that routinely pops up on Kaggle (a website that lets you participate in data science competitions) and which aspiring data scientists can expect to be asked about in job interviews. The recipe to make a robot loan officer is as follows:

Imagine you have a large table of 10 thousand rows, all about loan applicants that your bank has seen in the past:

ageincomecredit scoregendereducation levelnumber of years at employerjob titledid they default?

The final column is what we want to predict.

You would take this data, and split the rows into three groups, called the training set, the validation set and the test set.

You then pick a machine learning algorithm, such as Linear Regression, Random Forest or Neural Networks, and let it ‘learn’ from the training rows without letting it see the validation rows. You then test it on the validation set. You rinse and repeat for different algorithms, tweaking the algorithms each time, and the model you will eventually deploy is the one that scored the highest on your validation rows.

When you have finished you are allowed to test your model on the test dataset and check its performance.

The fallacy of removing a column and expecting bias to disappear

Now obviously if the ‘gender’ column was present in the training data, then there is a risk of building a biased model.

However the Apple/Goldman data scientists probably removed that column from their dataset at the outset.

So how can the digital money lender still be gender biased? Surely there’s no way for our algorithm to be sexist, right? After all it doesn’t even know an applicant’s gender!

Unfortunately and counter-intuitively, it is still possible for bias to creep in!

There might be information in our dataset that is a proxy for gender. For example: tenure in current job, salary and especially job title could all correlate with our applicant being male or female.

If it’s possible to train a machine learning model on your sanitised dataset to predict the gender with any degree of accuracy, then you are running the risk of your model accidentally being gender biased. Your loan prediction model could learn to use the implicit hints about gender in the dataset, even if it can’t see the gender itself.

A manifesto for unbiased AI

I would like to propose an addition to the workflow of AI development: we should attack our AI from different angles, attempting to discover any possible bias, before deploying it.

It’s not enough just to remove the protected categories from your dataset, dust off your hands and think ‘job done’.

AI bias pen-test

We also need to play devil’s advocate when we develop an AI, and instead of just attempting to remove causes of bias, we should attempt to prove the presence of bias.

If you are familiar with the field of cyber security, then you will have heard of the concept of a pen-test or penetration test. A person who was not involved in developing your system, perhaps an external consultant, attempts to hack your system to discover vulnerabilities.

I propose that we should introduce AI pen-tests: an analogy to the pen-test for uncovering and eliminating AI bias:

What an AI pen-test would involve

To pen-test an AI for bias, either an external person, or an internal data scientist who was not involved in the algorithm development, would attempt to build a predictive model to reconstruct the removed protected categories.

So returning to the loan example, if you have scrubbed out the gender from your dataset, the pen-tester would try his or her hardest to make a predictive model to put it back. Perhaps you should pay them a bonus if they manage to reconstruct the gender with any degree of accuracy, reflecting the money you would otherwise have spent on damage control, had you unwittingly shipped a sexist loan prediction model.

Further AI bias stress tests

In addition to the pen-test above, I suggest the following further checks:

  • Segment the data into genders. Evaluate the accuracy of the model for each gender.
  • Identify any tendency to over and under estimate probability of default for either gender
  • Identify any difference in model accuracy by gender.

Further measures

I have not covered some of the more obvious causes of AI bias. For example it is possible that the training data itself is biased. This is highly likely in the case of some of the algorithms used in the criminal justice system.

What to do if you have discovered a bias?

Let’s assume that you have discovered that the algorithm you have trained does indeed exhibit a bias for a protected category such as gender. Your options to mitigate this are:

  • If the pen-test showed that another input parameter, such as job title, is serving as a proxy for gender, you can remove it, or attempt to obfuscate the gender related aspects of it or sanitise the data further until the pen-tester is unable to reconstruct the gender
  • you can reverse engineer the result of the pen-test to artificially morph your training data, until the gender is no longer discoverable.
  • you can manually correct the inner workings of your model to compensate for the bias
  • you can check your training table for bias. If your AI is learning from biased data then we cannot expect it to be unbiased.
  • if your predictions are less accurate for females than for males, it’s likely that you have e.g. more training data for men than for women. In these cases you can use data augmentation: you duplicate every female entry in your data until your training dataset is balanced.
  • you can also go out of your way to collect extra training data for underrepresented categories.
  • you can try to make your model explainable and identify where the bias is creeping in. If you are interested in going into more detail about machine learning explainability, I invite you to also read my earlier post about explainable AI.

An aside… bias in recruitment?

One application of this approach that I would be interested in investigating further, is how to eliminate bias if you are using machine learning for recruitment. Imagine you have an algorithm matching CVs to jobs. If it inadvertently spots gaps in people’s CVs that correspond to maternity leave and therefore gender, we run the risk of a discriminatory AI. I imagine this could be compensated for by some of the above suggestions, such as tweaking the training data and artificially removing this kind of signal. I think that the pen-test would be a powerful tool for this challenge.


Today large companies are very much aware of the potential for bad PR to go viral. So if the Apple Card algorithm is indeed biased I am surprised that nobody checked the algorithm more thoroughly before shipping it.

A loan limit differing by a factor of 10 depending on gender is an egregious error.

Had the data scientists involved in the loan algorithm, or indeed the recidivism prediction algorithm used by the state of Wisconsin, followed my checklist above for pen-testing and stress testing their algorithms, I imagine they would have spotted the PR disaster before it had a chance to make headlines.

Of course it is easy to point fingers after the fact, and the field of data science in big industry is as yet in its infancy. Some would call it a Wild West of under-regulation.

I think we can also be glad that some conservative industries such as healthcare have not yet adopted AI for important decisions. Imagine the fallout if a melanoma-analysing algorithm, or amniocentesis decision making model, turned out to have a racial bias.

For this reason I would strongly recommend that large companies releasing algorithms into the wild to take important decisions start to segregate out a team of data scientists whose job is not to develop algorithms, but to pen-test and stress test them.

The data scientists developing the models are under too much time pressure to be able to do this themselves, and as the cybersecurity industry has discovered through years of experience, sometimes it is best to have an external person play devil’s advocate and try to break your system.


How to predict how much a group of customers will spend

Earlier I wrote another post about predicting the spend of a single known customer. There is a related problem which is predicting the total spend of all your customers, or a sizeable segment of them.

Time series approach: segments of customers

If you don’t need to predict the spend of an individual customer, but you’re happy to predict it for groups of customers, you can bundle customers up into groups. For example rather than needing to predict the future spend of Customer No. 23745993, you may want to predict the average spend of all customers in Socioeconomic Class A at Store 6342.

In this case the great advantage is that you would not have so many empty values in your past time series. So your time series may look like this:

This means you can use a time series library such as Prophet, developed by Facebook.

Here’s what Prophet produces when I give it the data points I showed above, and ask it to produce a prediction for the next few days. You can see that it’s picked up the weekly cycle correctly.

This approach would be very useful if you only needed the data for budgeting or stock planning purposes for an individual store and not for individual customers.

However if you had small enough customer segments, you may find that the prediction for a customer’s segment is adequate as a prediction for that customer.

Multilevel models

The next step up in complexity is multilevel models, where you use a different level of model for each region or economic group of customers, and combine them into a single group model.


To get the maximum predictive power you can try ways of combining time series methods with a predictive modelling approach, such as taking the results of a time series prediction for a customer’s segment and using it as input to a predictive model.

Getting started

If you have a prediction problem in retail, or would like to some help with another problem in data science, I’d love to hear from you. Please contact me via the contact form.

How well can you predict an individual customer’s spending habits?

You may have read my previous post about customer churn prediction. Another similar problem that’s just as important as predicting lost customers, is predicting customers’ daily expenditure.

Let me give you an example: you work for a large retailer which has a loyalty card scheme. You’d like to predict for a given customer how much they are likely to spend over the next week.

In this case normally there would be clear patterns

  • customers buy more on Mondays than on Saturdays (weekly cycle)
  • there might be a monthly cycle and a yearly cycle
  • Christmas, Easter and bank holidays might drive an explosion in demand

However there are a few problems when you get down to customer level:

  • some customers may have visited your shop only once
  • some have visited hundreds of times
  • a customer might not enter the shop for a few months but then come back (dormant customer)

What this means is that, if you look at all customers’ expenditures (or averaged over a region), you will probably see some recognisable weekly, monthly and seasonal patterns:

However for a single customer it’s hard to make out any recognisable pattern among all the noise. The weekly and yearly trends were only apparent when we averaged over all customers.

So how can you go about predicting the future expenditure of a given customer the next time they enter the shop?

This problem is quite interesting as there are at least two very different approaches to solving it, from two different traditional disciplines:

  • Predictive modelling (from the field of machine learning) – focusing on an individual customer
  • Time series analysis (from the field of statistics) – focusing on groups of customers

This means that depending on whether you hire somebody with a machine learning background, or somebody with a statistics background, you may get two contradictory answers.

In this post I’ll talk only about the predictive modelling approach.

If you are interested in predicting the first graph, which is averages for groups of customers, you might want to look into my next post on time series analysis.

Predictive model: individual customer

The simplest way would be to use a predictive modelling machine learning approach. For example you could use Linear Regression. If you are unfamiliar with how to do this I recommend Andrew Ng’s Coursera course.

You would provide as input to your Regression model:

  • Last purchase value (if available)
  • Second last purchase value (if available)
  • Third last purchase value (if available)

The output you want it to predict is:

  • The next purchase value

This will predict the next purchase with some accuracy. After all the biggest factor to predict what someone will buy, is what they bought in the past.

However I’m sure you can easily think of some cases where this will break down. For example

  • A customer with no past purchases
  • Over Christmas if purchases tend to be bigger

You can improve the performance of the Predictive Model approach by making it a little more sophisticated:

  • Add more input features to the Regression model such as “day of week”, “day of year”, “isChristmasSeason” etc.
  • Switch to a Polynomial Regression Model, or Random Forest Regression. This will allow your model to become more powerful if the relationships between your inputs and outputs are not entirely linear, although it comes with a risk of your predictions going a crazy (like predicting huge numbers) if you are not careful!
  • Make different models 

Getting started

If you have a prediction problem in retail, or would like to some help with another problem in data science, I’d love to hear from you. Please contact me via the contact form.

Building explainable machine learning models

Sometimes as data scientists we will encounter cases where we need to build a machine learning model that should not be a black box, but which should make transparent decisions that humans can understand. This can go against our instincts as scientists and engineers, as we would like to build the most accurate model possible.

In my previous post about face recognition technology I compared some older hand-designed technologies which are easily understandable for humans, such as facial feature points, to the state of the art face recognisers which are harder to understand. This is an example of the trade-off between performance and interpretability.

The need for explanations

Imagine that you have applied for a loan and the bank’s algorithm rejects you without explanation. Or an insurance company gives you an unusually high quote when the time comes to renew. A medical algorithm may recommend a further invasive test, against the best instincts of the doctor using the program.

Or maybe the manager of the company you are building the model for doesn’t trust anything he or she doesn’t understand, and has demanded an explanation of why you predicted certain values for certain customers.

All of the above are real examples where a data scientist may have to trade some performance for interpretability. In some cases the choice comes from legislation. For example some interpretations of GDPR give an individual a ‘right to explanation’ of any algorithmic decision that affects them.

How can we make machine learning models interpretable?

One approach is to avoid highly opaque models such as Random Forest, or Deep Neural Networks, in favour of more linear models. By simplifying architecture you may end up with a less powerful model, however the loss in accuracy may be negligible. Sometimes by reducing parameters you can end up with a model that is more robust and less prone to overfitting. You may be able to train a complex model and use it to identify feature importance, or clever preprocessing steps you could take in order to keep your model linear.

An example would be if you have a model to predict sales volume based on product price, day, time, season and other factors. If your manager or customer wanted an explainable model, you might convert weekdays, hours and months into a one-hot encoding, and use these as inputs to a linear regression model.

Computer vision

The best models for image recognition and classification are currently Convolutional Neural Networks (CNNs). But they present a problem from a human comprehension point of view: if you want to make the 10 million numbers inside a CNN understandable for a human, how would you proceed? If you’d like a brief introduction to CNNs please check out my previous post on face recognition.

You can make a start by breaking the problem up and looking at what the different layers are doing. We already know that the first layer in a CNN typically recognise edges, later layers are activated by corners, and then gradually more and more complex shapes.

You can take a series of images of different classes and looking at the activations at different points. For example if you pass a series of dog images through a CNN:

Image credit: Zeiler & Fergus (2014) [1]

…by the 4th layer you can see patterns like this, where the neural network is clearly starting to pick up on some kind of ‘dogginess’.

Image credit: Zeiler & Fergus (2014) [1]

Taking this one step further, we can tamper with different parts of the image and see how this affects the activation of the neural network at different stages. By greying out different parts of this Pomeranian we can see the effect on Layer 5 of the network, and then work out which parts of the original image scream ‘Pomeranian’ most loudly to the network.

Image credit: Zeiler & Fergus (2014) [1]

Using these techniques, if your neural network face recogniser backfires and lets an intruder into your house, if you have the input images it would be possible to unpick the CNN to work out where it went wrong. Unfortunately going deep into a neural network like this would take a lot of time, so perhaps a lot of work remains to be done here.

Moving towards linear models

Imagine you have trained a price elasticity model that uses 3rd order polynomial regression. But your client requires something easier to understand. They want to know for each additional penny reduced from the price of the product, what will be the increase in sales? Or for each additional year of age of a vehicle what is the price depreciation?

You can try a few tricks to make this more understandable. For example you can convert your polynomial model to a series of joined linear regression models. This should give almost the same power but could be more interpretable.

Traditional polynomial regression fitting a curve, showing car price depreciation by age of vehicle
Splitting up the data into segments and applying a linear regression to each segment. This is useful because it shows a ballpark rate of depreciation at different stages, which salespeople might find useful for quick calculations.

Recommendation algorithms

Recommendation systems such as Netflix’s movie recommendations are notoriously hard to get right and users are often mystified by what they see as strange recommendations. The recommendations were usually calculated directly or indirectly because of previous shows that the user has watched. So the simplest way of explaining a recommendation system is to display a message such as ‘we’re recommending you The Wire because you watched Breaking Bad’ – which is Netflix’s approach.

General method applicable to all models

There have been some efforts to arrive at a technique that can demystify and explain a machine learning model of any type, no matter how complex.

The technique that I described for investigating a convolutional neural network can be broadly extended to any kind of model. You can try perturbing the input to a machine learning model and monitoring its response to perturbations in the input. For example if you have a text classification model, you can change or remove different words in the document and watch what happens.


One implementation of this technique is called LIME, or Local Interpretable Model-Agnostic Explanations[2]. LIME works by taking an input and creating thousands of duplicates with small noise added, and passing these duplicate inputs to the ML model and comparing the output probabilities. This way it’s possible to investigate a model that would otherwise be a black box.

Trying out LIME on a CNN text classifier

I tried out LIME on my author identification model. I gave the model an excerpt of one of JK Rowling’s non-Harry Potter novels, where it correctly identified the author, and asked LIME for an explanation of the decision. So LIME tried changing words in the text and checked which changes increase or decrease the probability that JK Rowling wrote it.

LIME explanation for an extract of The Cuckoo’s Calling by JK Rowling, for predictions made by a stylometry model trained on some of her earlier Harry Potter novels

LIME’s explanation of the stylometry model is interesting as it shows how the model has recognised the author by subsequences of function words such as ‘and I don’t…’ (highlighted in green) rather than strong content words such as ‘police’.

However the insight provided by LIME is limited because under the hood, LIME is perturbing words individually, whereas a neural network based text classifier looks at patterns in the document on a larger scale.

I think that for more sophisticated text classification models there is still some work to be done on LIME so that it can explain more succinctly what subsequences of words are the most informative, rather than individual words.

LIME on images

With images, LIME gives some more exciting results. You can get it to highlight the pixels in an image which led to a certain decision.

Image credit: Ribeiro, Singh, Guestrin (2016) [2]


There is a huge variety of machine learning models being used and deployed for diverse purposes, and their complexity is increasing. Unfortunately many of them are still used as black boxes, which can pose a problem when it comes to accountability, industry regulation, and user confidence in entrusting important decisions to algorithms as a whole.

The simplest solution is sometimes to make compromises, such as trading performance for interpretability. Simplifying machine learning models for the sake of human understanding can have the advantage of making models more robust.

Thankfully there have been some efforts to build explainability platforms to make black box machine learning more transparent. I have experimented with LIME in this article which aims to be model-agnostic, but there are other alternatives available.

Hopefully in time regulation will catch up with the pace of technology, and we will see better ways of producing interpretable models which do not reduce performance.


  1. Zeiler M.D., Fergus R. (2014) Visualizing and Understanding Convolutional Networks. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham
  2. Ribeiro T.M., Singh, S., Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. 97-101. 10.18653/v1/N16-3020.

How to improve conversions without losing customer data

You may have had the experience of filling out a long form on a website. For example, creating an account to make a purchase, or applying for a job, or renewing your car insurance.

A long form can lead to customers losing interest and taking their business elsewhere. Each additional field can result in up to 10% more customers dropping out instead of completing the form.

If you have a business with a form like this, one reason why you’re not able to simplify your form is because the data you are requesting is valuable.

There are lots of ways to address the problem, such as improving the design of the form, or splitting it across multiple pages, removing the “confirm password” field, and so on. But it appears that most fields can’t be removed without inherently degrading the data you collect on these new customers.

However with machine learning it’s possible to predict the values of some of these fields, and completely remove them from the form without sacrificing too much information. This way you gain more customers. You would need to have a history of what information customers have provided in the past, in order to remove the fields for new customers.

A few examples

  • On a small ads site, you require users to upload a photo, or fill out a description of the item they’re selling. With machine learning you can suggest a price from the description, or a title from the photo, resulting in less typing for the user.
  • On a recruitment website, you can use machine learning to deduce lots of data (name, address, salary, desired role) directly from the candidate’s CV when it’s uploaded. Even salary can be predicted although it’s not usually explicit in the CV.
  • On a car insurance website, it’s possible to retrieve make, model, car tax and insurance status from an image of the car.

If you are interested and would like to know more please send me a message.

For an example of how data can be inferred from an unstructured text field please check out my forensic stylometry demo.

Building a face recogniser: traditional methods vs deep learning

Face recognition technology has existed for quite some time, but until recently it was not accurate enough for most purposes.
Now it seems that face recognition is everywhere:

  • you upload a photo to Facebook and it suggests who is in the picture
  • your smartphone can probably recognise faces
  • lots of celebrity look-a-like apps have suddenly appeared on the app stores
  • police and antiterrorism units all over the world use the latest in face recognition technology

The reason why facial recognition software has recently got a lot better and a lot faster is due to the advent of deep learning: more powerful and parallelised computers, and better software design.
I’m going to talk about what’s changed.

Traditional face recognition: Eigenfaces
The first serious attempts to build a face recogniser were back in the 1980s and 90s and used something called Eigenfaces. An Eigenface is a blurry face-like image, and a face recogniser assumes that every face is made of lots of these images overlaid on top of each other pixel by pixel.

If we want to recognise an unknown face we just work out which Eigenfaces it’s likely to be composed of.
Not surprisingly the Eigenface method didn’t work very well. If you shift a face image a few pixels to the right or left, you can easily see how this method will fail, since the parts of the face won’t line up with the eigenface any more.

Next step up in complexity: facial feature points
The next generation of face recognisers would take each face image and find important points such as the corner of the mouth, or an eyebrow. The coordinates of these points are called facial feature points. One well known commercial program converts every face into 66 feature points. 

To compare two faces you simply compare the coordinates (after adjusting in case one image is slightly off alignment).

Not surprisingly the facial feature coordinates method is better than the Eigenfaces method but is still suboptimal. We are throwing lots of useful information away: hair colour, eye colour, any facial structure that isn’t captured by a feature point, etc.

Deep learning approach

The last method in particular involved a human programming into a computer the definition of an “eyebrow” etc. The current generation of face recognisers throws all this out of the window.

This approach used convolutional neural networks (CNNs). This involves repeatedly walking a kind of stencil over the image and working out where subsections of the image match particular patterns.

The first time, you pick up corners and edges. After doing this five times, each time on the output of the previous run, you start to pick up parts of an eye or ear. After 30 times, you have recognised a whole face!

The neat trick is that nobody has defined the patterns that we are looking for but rather they come from training the network with millions of face images.

Of course this can be an Achilles’ heel of the CNN approach since you may have no idea exactly why a face recogniser gave a particular answer.

The obstacle you encounter if you want to develop your own CNN face recogniser is, where can you get millions of images to develop the model? Lots of people scrape celebrity images from the internet to do this.

However you can get much more images if you can get people to give you their personal photos for free!

This is the reason why Facebook, Microsoft and Google have some of the most accurate face recognisers, since they have access to the resources necessary to train the models.

The CNN approach is far from perfect and many companies will have some adjustments on top of what I described in order to compensate for its limitations, such as correcting for pose and lighting, often using a 3D mesh model of the face. The field is advancing rapidly and every year the state of the art in face recognition brings a noticeable improvement.

If you’d like to know more about this field or similar projects please get in touch.

Predicting customer churn

One question faced by lots of companies in competitive markets, is… why are our customers leaving us? What drives them to switch to a competitor? This is called ‘customer churn’.

Imagine you run a utility company. You know this about each of your customers:

  • When they signed the first contract
  • How much power they use on weekdays, weekends, etc
  • Size of household
  • Zip code / Postcode

For millions of customers you also know whether they stayed with your company, or switched to a different provider.

Ideally you’d like to identify the people who are likely to switch their supply, before they do so! Then you can offer them promotions or loyalty rewards to convince them to stay.

How can you go about this?

If you have a data scientist or statistician at your company, they can probably run an analysis and produce a detailed report, telling you that high consumption customers in X or Y demographic are highly likely to switch supply.

It’s nice to have this report and it probably has some pretty graphs. But what I want to know is, for each of the 2 million customers in my database, what is the probability that the customer will churn?

If you build a machine learning model you can get this information. For example, customer 34534231 is 79% likely to switch to a competitor in the next month.

Surprisingly building a model like this is very simple. I like to use Scikit-learn for this which is a nice easy-to-use machine learning library in Python. It’s possible to knock up a program in a day which will connect to your database, and give you this probability, for any customer.

One problem you’ll encounter is that the data is very non-homogeneous. For example, the postcode or zip code is a kind of category, while power consumption is a continuous number. For this kind of problem I found the most suitable algorithms are Support Vector Machines, and Random Forest, both of which are in Scikit-learn. I also have a trick of augmenting location data with demographic data for that location, which improves the accuracy of the prediction.

If customer churn is an issue for your business and you’d like to anticipate it before it happens, I’d love to hear from you! Get in touch via the contact form to find out more.

How you can identify the author of a document

Click here to see a live online demo of the neural network forensic stylometry model described in this article.

In 2013 JK Rowling, the author of the Harry Potter series, published a new detective novel under the pen name Robert Galbraith. She wanted to publish a book without the hype resulting from the success of the Harry Potter books.

However, following a tip-off received by a journalist on Twitter, two professors of computational linguistics showed that JK Rowling was highly likely to be the author of the new detective novel.

How did they manage to do this? Needless to say, the crime novel is set in a strictly non-magical world, and superficially it has little in common with the famous wizarding series.

One of the professors involved in the analysis said that he calculates a “fingerprint” of all the authors he’s interested in, which shows the typical patterns in that author’s works.

What’s my linguistic fingerprint? Subconsciously we tend to favour some word patterns over others. Is your salad fork “on” the left of the plate, or “to” the left of the plate? Do you favour long words, or short words? By comparing the fingerprint of a mystery novel to the fingerprints of some known authors it’s possible to get a match.

Here are some (partial) fingerprints I made for three well known female authors who used male pen names:

Identifying the author of a text is a field of computational linguistics called forensic stylometry.

With the advent of ‘deep learning’ software and computing power, forensic stylometry has become much easier. You don’t need to define the recipe for your fingerprint anymore, you just need lots of data.

My favourite way of approaching this problem is a Convolutional Neural Network, which is a deep learning technique that was developed for recognising photos but works very well for natural language!

The technology I’ve described has lots of commercial applications, such as

  • Identifying the author of a terrorist pamphlet
  • Extracting information from company financial reports
  • Identifying spam emails, adverts, job postings
  • Triage of incoming emails
  • Analysis of legal precedents in a Common Law system

If you have a business problem in this area and you’d like some help developing and deploying, or just some consulting advice, please get in touch with me via the contact form.

On 5th July 2018 I will be running a workshop on forensic stylometry aimed at beginners and programmers, at the Digital Humanities Summer School at Oxford University. You can sign up here:

Update: click here to download the presentation from the workshop.