Look at how data is being gathered and used in your business, and identify opportunities to extract value from your large datasets
I offer the full range of data science consulting services, from a simple overview and high level consultation, to building and deploying a machine learning model to production.
Look at how data is being gathered and used in your business, and identify opportunities to extract value from your large datasets
Use statistics and AI to identify meaningful patterns in your data, enabling you to make smart decisions
Design and train a machine learning model for your numeric, tabular, text or image data, making use of cutting edge machine learning tools
Bring AI solutions through to production, deploying with your preferred technology stack and fully integrating with your systems and APIs
Do you have millions of customers and need to predict the likely behaviour of each individual one? Who's going to switch to a competitor? Which is the most appropriate product recommendation? Or perhaps you need to predict unknown values in the future such as vehicle unloading times, travel times, signup rates, or customer spend? Maybe you have large amounts of unstructured text or image data? In all of these cases I can help.
Observations about the latest developments in the AI universe.
You may have read about the recent Google Health study where the researchers trained and evaluated an AI model to detect breast cancer in mammograms.
It was reported in the media that the Google team’s model was more accurate than a single radiologist at recognising tumours in mammograms, although admittedly inferior to a team of two radiologists.
But what does ‘more accurate’ mean here? And how can scientists report it for a lay audience?
Imagine that we have a model to categorise images into two groups: malignant and benign. Imagine that your model categorises everything as benign, whereas in reality 10% of images are malignant and 90% are benign. This model would be useless but would also be 90% accurate.
This is a simple example of why accuracy can often be misleading.
In fact it is more helpful in a case like this to report two numbers: how many malignant images were misclassified as benign (false negatives), and how many benign images were misclassified as malignant (false positives).
The Google team reported both error rates in their paper:
We show an absolute reduction of 5.7%… in false positives and 9.4%… in false negatives [compared to human radiologists].McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020)
This means that the model improved in both kinds of misclassifications. If only one error rate had improved with respect to the human experts, it would not be possible to state whether the new AI was better or worse than humans.
Sometimes we want even finer control over how our model performs. The mammogram model has two kinds of misdiagnoses: the false positive and the false negative. But they are not equal. Although neither kind of error is desirable, the consequences of missing a tumour are greater than the consequences of a false alarm.
For this reason we may want to calibrate the sensitivity of a model. Often the final stage of a machine learning model involves outputting a score: a probability of a tumour being present.
But ultimately we must decide which action to take: to refer the patient for a biopsy, or to discharge them. Should we act if our model’s score is greater than 50%? Or 80%? Or 30%?
If we set our cutoff to 50%, we are assigning equal weight to both actions.
However we probably want to set the cutoff to a lower value, perhaps 25%, meaning that we err on the side of caution because we don’t mind reporting some benign images as malignant, but we really want to avoid classifying malignant images as benign.
However we can’t set the cutoff to 0% – that would mean that our model would classify all images as malignant, which is useless!
So in practice we can vary the cutoff and set it to something that suits our needs.
Choosing the best cutoff is now a tricky balancing act.
If we want to evaluate how good our model is, regardless of its cutoff value, there is a neat trick we can try: we can set the cutoff to 0%, 1%, 2%, all the way up to 100%. At each cutoff value we check how many malignant→benign and benign→malignant errors we had.
Then we can plot the changing error rates as a graph.
We call this a ROC curve (ROC stands for Receiver Operating Characteristic).
The nice thing about a ROC curve is that is lets you see how a model performs at a glance. If your model is just a coin toss, your ROC curve would be a straight diagonal line from the bottom left to the top right. The fact that Google’s ROC curve bends up and left shows that it’s better than a coin toss.
If we need a single number to summarise how good a model is, we can take the area under the ROC curve. This is called AUC (area under the curve) and it works a lot better than accuracy for comparing different models. A model with a high AUC is better than one with a low AUC. This means that ROC curves are very useful for comparing different AI models.
You can also put human readers on a ROC curve. So Google’s ROC curve contains a green data point for the human radiologists who were interpreting the mammograms. The fact that the green point is closer to the diagonal than any point on the ROC curve confirms that the machine learning model was indeed better than the average human reader.
Whether the machine learning model outperformed the best human radiologists is obviously a different question.
In healthcare, as opposed to other areas of machine learning, the cost of a false negative or false positive can be huge. For this reason we have to evaluate models carefully and we must be very conservative when choosing the cutoff of a machine learning classifier like the mammogram classifier.
It is also important for a person not involved in the development of the model to evaluate and test the model very critically.
If the mammogram was to be introduced into general practice in healthcare I would expect to see the following robust checks to prove its suitability:
If you think I have missed anything please let me know. I think we are close to seeing these models in action in our hospitals but there are still lots of unknown steps before the AI revolution conquers healthcare.
Thanks to Ram Rajamaran for some interesting discussions about this problem!
Hamzelou, AI system is better than human doctors at predicting breast cancer, New Scientist (2020).
McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020).
Is it possible to identify when somebody is not telling the truth? You may be aware of the subtle body language, tics and signals that give away a liar, but what about the written word?
Furthermore, what if you are reading a news article by a famous reporter and you suspect that it is not true?
I am going to tell you a little about a reporter called Claas Relotius, who was once one of Germany’s most respected reporters, and was later exposed as a fraud and was found to have fabricated hundreds of articles over an eight year period.
Then I will attempt some data science magic on Relotius’ articles, to see what we can learn.
In 2018, a caravan of several thousand central American migrants were making their way from Honduras through the Sonora desert in Mexico and onwards to the final goal of the United States.
Juan Moreno, a 45 year old freelance reporter, was travelling alongside the migrant caravan and gathering some material for a feature piece for Der Spiegel, a prestigious German news magazine.
Moreno had been tasked with covering the caravan as they travelled through Mexico. He had spent several gruelling weeks in the desert and had already identified two young women who were willing to let him shadow them for a few days.
He was not happy to receive an email from the Spiegel editors saying that his young, successful colleague Claas Relotius would now be working on the article with him and would take editorial control over the final version.
Relotius had won more than 40 prizes in journalism and was widely regarded as a rising star in the field.
Relotius was to travel to Arizona and track down a militia, a group of volunteers who spend their time and money defending the US southern border from the perceived threat of illegal migration, while Moreno would stay in Mexico and continue to report on the migrants.
After the assignment was finished, Moreno flew back to Germany.
When Moreno received Relotius’ drafts and final article, titled
Jaeger’s border (German: Jaegers Grenze), he felt that something just didn’t feel right. Relotius claimed to have spent a few days in the company of a militia called the Arizona Border Recon. The members of Arizona Border Recon were armed and went by colourful nicknames such as Jaeger, Spartan and Ghost. Relotius even claimed to have witnessed Jaeger shooting at an unidentified figure in the desert. In short, the militia were portrayed as a stereotypical band of hillbillies, and some of the details seemed hard to believe.
Moreno started digging into Jaeger’s Border and Relotius’ articles. He spent his savings on his own private investigation. He travelled in Relotius’ footsteps to Arizona and other locations. It quickly became clear that Relotius had been fabricating stories rather than interviewing the subjects he claimed to have interviewed.
Many of Relotius’ articles relied on stereotypes and the stories seemed far-fetched and too good to be true. For me, the most absurd story centres on a brother and sister from Syria who were working in a Turkish sweatshop. Relotius invented a Syrian children’s song about two orphans who grow up to be king and queen of Syria. According to the article, every Syrian child “from Raqqah to Damascus” is familiar with this traditional song. But none of the Syrians that Moreno spoke to had ever heard of it.
After much persistence on behalf of Moreno, the management at Der Spiegel reluctantly investigated Relotius’ articles, and concluded that he had indeed fabricated the majority of his articles during his 8 year tenure.
Relotius had invented interviews that never took place, and people who never existed. He even wrote an article about rising sea levels in the Pacific island of Kiribati without bothering to take his connecting flight to the country.
Der Spiegel issued a mass retraction of the affected articles and the ‘Relotius Affair’ became a nationwide scandal, making news worldwide and prompting an intervention by the US ambassador to Germany who objected to the “anti-American sentiment” of some of the articles.
The article Jaeger’s Border and Relotius’ other texts can be downloaded as a PDF from Der Spiegel’s website. In total 59 articles are available for download, together with annotations by Der Spiegel indicating what content is genuine and what is pure invention.
There is a large amount of English language content available online on the Relotius scandal, including English translations of many of the articles.
I downloaded all 59 available Relotius articles and Der Spiegel’s annotations and tried a few data science experiments on them.
First of all I checked the truth/falsehood status of the articles. You can see that more than half are fictitious, although there are some articles where it was not possible for Der Spiegel to determine if the article was genuine or not. I excluded the latter from my analysis.
The vast majority of Relotius’ articles were written by him alone. Moreno later stated that this was quite unusual at Der Spiegel for a reporter to take on so many lone assignments, but Relotius was the star reporter at the publication and seemed to have acquired a certain privilege in this regard.
Of course we know now it was easier for him to fabricate content when working alone.
There is something else interesting about the above graph. Relotius wrote only one article in a team of two. The other collaborative articles all involved larger teams of up to 14 authors.
The sole two-author article is Jaeger’s Border, the article which got Relotius caught out!
This shows that Relotius had a pattern of either writing articles alone, or in a large team. He managed to get away with this strategy for years until the Jaeger’s Border assignment. Perhaps when you are collaborating in a large group it is also easier to avoid scrutiny.
I tried generating a word cloud of the true and fake articles, to see if there is any discernible difference. A word cloud shows words in different font sizes according to how often they occur in a set of documents.
Unfortunately there is not a huge difference between the two sets.
However I can see some patterns.
I then tried a more scientific approach. I used a tool called a Naive Bayes Classifier to find the words which most strongly indicate that an article is genuine or fictitious.
The Naive Bayes approach assigns a large negative number to words that strongly indicate fake news and a smaller negative number to words that indicate genuine news.
Here are the top 15 words that indicate that an article is genuine, with English translations and the scores from the Naive Bayes classifier:
|sei||is (reported speech)||-8.64|
and here some of the top 15 words that indicate that an article is fictitious:
This is just a snapshot but we can see some more patterns now. The fake news seems to be quite heavy in strong, emotive or very graphic language such as corrupt or mutilated. When I took the top 100 words this effect is still noticeable.
I then tested to see if it was possible to use the Naive Bayes Classifier to predict if an unseen Relotius text was fake or genuine, but unfortunately this was not possible to any degree of accuracy.
It is not possible to build a fake news detector given that we only have 59 articles to work from, but knowing in retrospect that Relotius falsified some texts, it is definitely possible to observe patterns and significant differences between his genuine and fake articles:
Perhaps knowing these effects it may be possible to flag suspicious texts in the future. If a reporter seems overly keen on working alone, travelling abroad, and seems to interview few subjects, but writes using colourful language that would be more appropriate in a novel, then perhaps something is amiss?
Naturally Relotius’ prizes were revoked and returned one by one, and he resigned from his position at Der Spiegel.
Juan Moreno, the whistleblower who discovered Relotius’ fraud, wrote a tell-all book about the Relotius Affair, titled A Thousand Lines of Lies (Tausend Zeilen Lüge). The book is a fascinating exposé of the world of print journalism in the digital age as well as a first hand account of how Relotius’ system unravelled.
Ironically in 2019 Relotius started legal proceedings against Moreno for alleged falsehoods in the book, which are ongoing at the time of writing.
In case you would like more details, I used a multinomial Naive Bayes classifier with tf*idf scores. I evaluated it below using a ROC curve:
Juan Moreno, Tausend Zeilen Lüge: Das System Relotius und der deutsche Journalismus (A thousand lines of lies: the Relotius system and what it means for German journalism) (2019).
Claas Relotius, Bürgerwehr gegen Flüchtlinge: Jaegers Grenze (Militia against refugees: Jaeger’s Border), and all other Claas Relotius texts, Der Spiegel (2018).
Philip Oltermann, The inside story of Germany’s biggest scandal since the Hitler diaries, The Guardian (2019).
Ralf Wiegand, Claas Relotius geht gegen Moreno-Buch vor (Claas Relotius takes action against Moreno’s book), Sueddeutsche Zeitung (2019).
(Multinomial Naive Bayes) C.D. Manning, P. Raghavan and H. Schuetze, Introduction to Information Retrieval, pp. 234-265 (2008).
In recent weeks a number of Apple Card users in the US have been reporting that they and their partners have been allocated vastly different credit limits on the branded credit card, despite having the same income and credit score (see BBC article). Steve Wozniak, a co-founder of Apple, tweeted that his credit limit on the card was ten times higher than his wife’s, despite the couple having the same credit limit on all their other cards.
The Department of Financial Services in New York, a financial services regulator, is investigating allegations that the users’ gender may be the base of the disparity. Apple is keen to point out that Goldman Sachs is responsible for the algorithm, seemingly at odds with Apple’s marketing slogan ‘Created by Apple, not a bank’.
Since the regulator’s investigation is ongoing and no bias has yet been proven, I am writing only in hypotheticals in this article.
The Apple Card story isn’t the only recent example of algorithmic bias hitting the headlines. In July last year the NAACP (National Association for the Advancement of Colored People) in the US signed a statement requesting a moratorium on the use of automated decision-making tools, since some of them have been shown to have racial bias when used to predict recidivism – in other words, how likely an offender is to re-offend.
In 2013, Eric Loomis was sentenced to six years in prison, after the state of Wisconsin used a program called COMPAS to calculate his odds of committing another crime. COMPAS is a proprietary algorithm whose inner workings are known only to its vendor Equivant. Loomis attempted to challenge the use of the algorithm in Wisconsin’s Supreme Court but his challenge was ultimately denied.
Unfortunately incidents such as these are only worsening the widely held perception of AI as a dangerous tool, opaque, under-regulated, capable of encoding the worst of society’s prejudices.
I will focus here on the example of a loan application, since it is a simpler problem to frame and analyse, but the points I make are generalisable to any kind of bias and protected category.
I would like to point out first that I strongly doubt that anybody at Apple or Goldman Sachs has sat down and created an explicit set of rules that take gender into account for loan decisions.
Let us first of all imagine that we are creating a machine learning model which predicts the probability of a person defaulting on a loan. There are a number of ‘protected categories’, such as gender, which we are not allowed to discriminate on.
Developing and training a loan decision AI is that kind of ‘vanilla’ data science problem that routinely pops up on Kaggle (a website that lets you participate in data science competitions) and which aspiring data scientists can expect to be asked about in job interviews. The recipe to make a robot loan officer is as follows:
Imagine you have a large table of 10 thousand rows, all about loan applicants that your bank has seen in the past:
|age||income||credit score||gender||education level||number of years at employer||job title||did they default?|
The final column is what we want to predict.
You would take this data, and split the rows into three groups, called the training set, the validation set and the test set.
You then pick a machine learning algorithm, such as Linear Regression, Random Forest or Neural Networks, and let it ‘learn’ from the training rows without letting it see the validation rows. You then test it on the validation set. You rinse and repeat for different algorithms, tweaking the algorithms each time, and the model you will eventually deploy is the one that scored the highest on your validation rows.
When you have finished you are allowed to test your model on the test dataset and check its performance.
Now obviously if the ‘gender’ column was present in the training data, then there is a risk of building a biased model.
However the Apple/Goldman data scientists probably removed that column from their dataset at the outset.
So how can the digital money lender still be gender biased? Surely there’s no way for our algorithm to be sexist, right? After all it doesn’t even know an applicant’s gender!
Unfortunately and counter-intuitively, it is still possible for bias to creep in!
There might be information in our dataset that is a proxy for gender. For example: tenure in current job, salary and especially job title could all correlate with our applicant being male or female.
If it’s possible to train a machine learning model on your sanitised dataset to predict the gender with any degree of accuracy, then you are running the risk of your model accidentally being gender biased. Your loan prediction model could learn to use the implicit hints about gender in the dataset, even if it can’t see the gender itself.
I would like to propose an addition to the workflow of AI development: we should attack our AI from different angles, attempting to discover any possible bias, before deploying it.
It’s not enough just to remove the protected categories from your dataset, dust off your hands and think ‘job done’.
We also need to play devil’s advocate when we develop an AI, and instead of just attempting to remove causes of bias, we should attempt to prove the presence of bias.
If you are familiar with the field of cyber security, then you will have heard of the concept of a pen-test or penetration test. A person who was not involved in developing your system, perhaps an external consultant, attempts to hack your system to discover vulnerabilities.
I propose that we should introduce AI pen-tests: an analogy to the pen-test for uncovering and eliminating AI bias:
To pen-test an AI for bias, either an external person, or an internal data scientist who was not involved in the algorithm development, would attempt to build a predictive model to reconstruct the removed protected categories.
So returning to the loan example, if you have scrubbed out the gender from your dataset, the pen-tester would try his or her hardest to make a predictive model to put it back. Perhaps you should pay them a bonus if they manage to reconstruct the gender with any degree of accuracy, reflecting the money you would otherwise have spent on damage control, had you unwittingly shipped a sexist loan prediction model.
In addition to the pen-test above, I suggest the following further checks:
I have not covered some of the more obvious causes of AI bias. For example it is possible that the training data itself is biased. This is highly likely in the case of some of the algorithms used in the criminal justice system.
Let’s assume that you have discovered that the algorithm you have trained does indeed exhibit a bias for a protected category such as gender. Your options to mitigate this are:
One application of this approach that I would be interested in investigating further, is how to eliminate bias if you are using machine learning for recruitment. Imagine you have an algorithm matching CVs to jobs. If it inadvertently spots gaps in people’s CVs that correspond to maternity leave and therefore gender, we run the risk of a discriminatory AI. I imagine this could be compensated for by some of the above suggestions, such as tweaking the training data and artificially removing this kind of signal. I think that the pen-test would be a powerful tool for this challenge.
Today large companies are very much aware of the potential for bad PR to go viral. So if the Apple Card algorithm is indeed biased I am surprised that nobody checked the algorithm more thoroughly before shipping it.
A loan limit differing by a factor of 10 depending on gender is an egregious error.
Had the data scientists involved in the loan algorithm, or indeed the recidivism prediction algorithm used by the state of Wisconsin, followed my checklist above for pen-testing and stress testing their algorithms, I imagine they would have spotted the PR disaster before it had a chance to make headlines.
Of course it is easy to point fingers after the fact, and the field of data science in big industry is as yet in its infancy. Some would call it a Wild West of under-regulation.
I think we can also be glad that some conservative industries such as healthcare have not yet adopted AI for important decisions. Imagine the fallout if a melanoma-analysing algorithm, or amniocentesis decision making model, turned out to have a racial bias.
For this reason I would strongly recommend that large companies releasing algorithms into the wild to take important decisions start to segregate out a team of data scientists whose job is not to develop algorithms, but to pen-test and stress test them.
The data scientists developing the models are under too much time pressure to be able to do this themselves, and as the cybersecurity industry has discovered through years of experience, sometimes it is best to have an external person play devil’s advocate and try to break your system.
Earlier I wrote another post about predicting the spend of a single known customer. There is a related problem which is predicting the total spend of all your customers, or a sizeable segment of them.
Time series approach: segments of customers
If you don’t need to predict the spend of an individual customer, but you’re happy to predict it for groups of customers, you can bundle customers up into groups. For example rather than needing to predict the future spend of Customer No. 23745993, you may want to predict the average spend of all customers in Socioeconomic Class A at Store 6342.
In this case the great advantage is that you would not have so many empty values in your past time series. So your time series may look like this:
This means you can use a time series library such as Prophet, developed by Facebook.
Here’s what Prophet produces when I give it the data points I showed above, and ask it to produce a prediction for the next few days. You can see that it’s picked up the weekly cycle correctly.
This approach would be very useful if you only needed the data for budgeting or stock planning purposes for an individual store and not for individual customers.
However if you had small enough customer segments, you may find that the prediction for a customer’s segment is adequate as a prediction for that customer.
The next step up in complexity is multilevel models, where you use a different level of model for each region or economic group of customers, and combine them into a single group model.
To get the maximum predictive power you can try ways of combining time series methods with a predictive modelling approach, such as taking the results of a time series prediction for a customer’s segment and using it as input to a predictive model.
If you have a prediction problem in retail, or would like to some help with another problem in data science, I’d love to hear from you. Please contact me via the contact form.
An overview of some of the projects I have been involved with in the past.
A large retail company had GPS records of vehicle telematics. I built an ML model to produce predictions of how long it takes to unload a vehicle and close the loading bay door, taking into account product types, time of day, and other variables. The predictive model had a constraint that it should return a prediction within a few milliseconds. The model was deployed and integrated into their traffic planning software, allowing the company to work with more accurate schedules, improving efficiency.
An internet based company had a signup form where users would upload some text files and then fill out a large amount of small text and dropdown fields. By training a machine learning model on the past data I was able to accurately predict some of the values, allowing some fields to be removed from the form. In an A/B test this was shown to improve conversions.