Look at how data is being gathered and used in your business, and identify opportunities to extract value from your large datasets
I offer the full range of data science consulting services, from a simple overview and high level consultation, to building and deploying a machine learning model to production.
Look at how data is being gathered and used in your business, and identify opportunities to extract value from your large datasets
Use statistics and AI to identify meaningful patterns in your data, enabling you to make smart decisions
Design and train a machine learning model for your numeric, tabular, text or image data, making use of cutting edge machine learning tools
Bring AI solutions through to production, deploying with your preferred technology stack and fully integrating with your systems and APIs
Do you have millions of customers and need to predict the likely behaviour of each individual one? Who's going to switch to a competitor? Which is the most appropriate product recommendation? Or perhaps you need to predict unknown values in the future such as vehicle unloading times, travel times, signup rates, or customer spend? Maybe you have large amounts of unstructured text or image data? In all of these cases I can help.
Observations about the latest developments in the AI universe.
When I talk to my colleagues in data science about successful projects that we’ve done in the past, one recurring theme comes up. We ask ourselves, which of our data science projects made it through to deployment and are used by the company that commissioned them?
I think for most of us the reality is that only a minority of what we do ends up making a difference.
According to a recent Gartner report, only between 15% and 20% of data science projects get completed. Of those projects that did complete, CEOs say that only about 8% of them generate value. If these figures are accurate, then this would amount to an astonishing 2% success rate.
So what is going wrong?
If you talk to the data scientists and analysts, you might hear, I made a great model, it has a wonderful accuracy, why did nobody use it? The business stakeholders and executives were hard to get hold of and unengaged.
If you talk to the stakeholders, they will say, the data scientists made a pretty model, and I was impressed by their qualifications, but it doesn’t answer our question.
There could be a number of possible causes.
On the business side,
On the data science side,
On both sides,
We need to structure the data science project effectively into a series of stages, so that engagement between the analytics team and the business does not break down.
Business question: First the project should start with a business question instead of focussing on data or technologies. The data scientists and executives should spend time together in a workshop formulating exactly what the question is that they want to solve. This is the initial hypothesis.
Data collection: Secondly the data scientist should move on to collecting only the relevant data that is needed to accept or reject the hypothesis. This should be done as quickly as possible rather than trying to do everything perfectly.
Back to stakeholders: Thirdly the data scientist needs to present initial insights to the stakeholders so that the project can be properly scoped and we can establish what we want to achieve. At this point the business stakeholders should be thoroughly involved and the data scientist should make sure that they understand at this point what the ROI will be if the project proceeds. If at this point the decision makers are not engaged, it would be a waste of money to continue with the project.
Investigation stage: Now the data scientist proceeds with the project. I recommend at least weekly catch ups with the main stakeholder, and slightly less regular catch ups with the high ranking executive whose support is needed for the project. The data scientist should favour simple over complex and choose transparent AI solutions wherever possible. At all stages the data scientist should be striving to keep engagement. Time spent in meetings with the stakeholder is not wasted, it is nurturing business engagement. At all points both parties should keep an eye on whether the investigation is heading towards an ROI for the organisation.
Presentation of insights: Finally at the end of the project the data scientist should present their insights and recommendations for the business to the stakeholder and all other high ranking executives. You can go overboard with materials: produce a presentation, a video recording, a white paper and also hand over source code, notebooks and data, so that both executive summaries and in depth handover data is available for all levels in the commissioning organisation from technical people to the CEO.
If the above steps are followed, by this point the value should be clear for the high ranking executives. The two-way communication between the data science team and the stakeholders should ensure ongoing buy-in and support from the business, and should also keep the data science work on track to delivering value by the end of the project.
You may have seen the news about Facebook’s new chatbot trained for empathy on 1.5 billion Reddit posts.
You might be wondering, how it is possible to make a computer program converse with humans in a natural way?
Natural language dialogue systems, also known as virtual assistants or chatbots, are an interesting area of artificial intelligence, and a field with a long history,
There are several challenges we encounter when making a chatbot. We need to:
Back in 1950, Alan Turing proposed that we can consider a computer as being capable of thinking like a human being, if we can interrogate it and we are unable to distinguish its answers from the answers of a real human. This thought experiment is now called the Turing Test. Of course whether a bot would pass the Turing Test depends a lot on the kind of questions we ask: if we limit ourselves to a small domain then bots tend to perform quite well.
In the 1960s Joseph Weizenbaum created a program called ELIZA, a psychotherapist bot which re-phrased the user’s inputs by means of clever rules:
User: You are like my father in some ways.
ELIZA: What resemblance do you see?
User: You are not very aggressive but I think you don’t want me to notice that.
ELIZA: What makes you think I am not aggressive?Example conversation with ELIZA
Clearly even if ELIZA managed to fluke the Turing Test, there is no human-like intelligence present. ELIZA has no idea of the meaning of aggressive, would not understand that belligerent is a synonym. If you told ELIZA, You are potato, she would probably blindly respond with Why do you think I am potato.
Fast forward to the 2010s and chatbots were already becoming a common solution for large organisations to cut costs in call centre staff. If you visit the website of any large airline, retailer or bank, you are often greeted by a little chat window where an avatar offers to guide you through the site.
These bots have two things in common: they operate within a narrow domain, and they are normally rule based, which means that a human has carefully crafted a set of rules to determine what response the bot should give in what context. In short, the same trick as ELIZA used but with more smoke and mirrors.
For example if you give an input I want to open an account, a banking bot will probably be listening for keywords open + account, and will trigger the corresponding pre-written response. Normally there is a cascade of rules that the bot attempts to match, going from strict to broad. So first the bot will check for open + account, and other two-word triggers, then simply account, and then fall back to a catch-all response such as I’m sorry, I didn’t quite understand what you’re looking for.
Retail website chatbots perform acceptably for the purpose for which they were designed, and can even cope with maintaining dialogue context, pronouns such as it/he/she, and rudimentary small talk. However they can be easily thrown by phrases or situations that they haven’t been designed for, and it’s labour intensive to develop them.
They are nearly always designed with a chat handover: when the bot fails to understand the input, the user is handed over to a human operator.
One development which brought chatbots to the public consciousness more than any other in the last ten years was Steve Jobs’ introduction of Siri to the iPhone in 2011. Siri was a program that allows you to say things like Set my alarm for 5 am tomorrow instead of doing this via the touchscreen.
To the best of my knowledge Siri was no more sophisticated than the bots I have described above, but the idea of combining a bot with voice interaction and to put it on a smartphone was very novel at the time, and brought a storm of publicity to the previously niche field of dialogue systems.
Siri sparked an arms race with other electronics companies, mobile phone manufacturers and Silicon Valley giants rushing to acquire or develop their own voice controlled virtual assistant. The next few years saw the release of Microsoft’s Cortana, Samsung’s Bixby, Amazon’s Alexa and Google Now.
Now it is quite easy to get started making your own chatbot. For example Google, Microsoft and Amazon all have options for you to get started making a bot for free.
We are starting to move away from hand-developed rules. Modern bot designing interfaces involve you entering a set of sample phrases that you want to recognise, and they will use machine learning to generalise this to a pattern so that a new unseen utterance can be correctly categorised.
In recent years we’ve seen some exciting advances in deep learning for natural language processing.
For example we no longer need to listen for key words in an utterance in order to guess as to the user’s intent.
In 2003 Yoshua Bengio developed the idea of word embeddings. Every word in the English language is assigned a vector in a multi-dimensional space. For example want and desire mean nearly the same thing, so their word vectors would be close together in the space.
If you use word embeddings then you can start to calculate distances between words and move towards a probability that a user wants to open an account, or contact support.
The next step up from word embeddings is a technology called BERT, developed in 2018. BERT is a neural network design that allows us to calculate a word vector taking into account the entire sentence, so that bank in the sense of financial institution and in the sense of riverbank would have different vectors. With BERT it’s possible to calculate a vector of an entire sentence.
Currently in all the dialogue system software that I’ve tried, you can upload a list of sample utterances to train a model, and you manually define what values you want your bot to listen for (destination cities, account types, product names, etc). You then manually define the desired behaviour if the user utters the right words.
What I would like to imagine on the horizon would be to really leverage machine learning to improve chatbots from all angles. Some examples of the kind of ideas that researchers are experimenting with at the moment are:
If you think I’ve missed anything important please add it in the comments below.
Of course it takes some time for any of these ideas to become commercially viable. But we can expect to see some exciting leaps in the next decade as the field becomes more democratised and more accessible to non-programmers.
One challenge that large organisations face today is the problem of understanding and predicting which employees are going to leave the business, called employee turnover or workforce attrition.
Employees do not always participate in offboarding processes, may not be truly forthcoming in the HR exit interview, and by the time the exit interview comes around it’s too late to address the issues which caused the employee to leave in the first place.
Furthermore, if you have a large workforce, then you may want to be able to predict which employees are at risk of leaving at any given time, how long they are expected to stay, and get a hint of which interventions may have a chance of reducing attrition.
Fortunately most organisations today will have some form of employee database. This can be a gold mine for data scientists who want to predict or explain employee turnover.
This problem is a little trickier than predicting customer spend. Any employee database is going to have a highly sensitive information. If you are in the UK or EU, the GDPR limits the type of analysis you can do on an employee database, the actions you are allowed to take based on employee data, and even the technology you can use. You may not be able to use external data storage and processors such as cloud services, and if you do you will be restricted to European servers.
Typically around your organisation you will have a disparate set of databases such as salary databases, employee address databases, onboarding and recruitment records, etc. They are probably maintained by different departments.
The first step would be to find a way to unify the datasets so that for every current and past employee you can easily access all data about them. You want to know when someone joined the organisation, when they advanced pay grades, and when they left.
Here a point about data strategies which you can adopt as an organisation to make this kind of data science experiment easier:
Once you have found out how to join together each employee’s record, the next step is to try to transform a dataset into a single flat table, which is the easiest format to feed into a machine learning algorithm.
There are many ways of transforming your employee data into a single table but here is one of the simplest:
You create a single table representing every employee present in the organisation on 1 January 2019, with columns for values such as the time they have spent in the organisation, and a final column set to TRUE or FALSE (a Boolean value), indicating whether they left the organisation by 31 January or not. This can be your training data.
Then create the same snapshot table for 1 January 2020, which can be your test data.
You can be creative with your columns. For example if you have the employees’ home address on the snapshot date then you can calculate their distance to the office, travel time, etc. The important thing is that all values in your table should be the values at the date of the snapshot, so the distance in the training table should be the distance from the office on 1 Jan 2019 to their home on 1 Jan 2019, and likewise the age should be the employee’s age on that date.
This latter point is quite tricky to get right, and if you are doing the joining operation in SQL you will need to use window functions.
Now you have both tables, you can feed them into a machine learning algorithm of your choice. You tell the algorithm these two things:
I know the employee’s age, distance to the office, time at pay grade, time in the organisation
I want to predict the employee’s turnover over the following month
If you like Python I recommend to try a Random Forest or Gradient Boosted Tree, or you can also use a cloud based auto ML tool such as Microsoft Azure or Google Cloud Platform. There are a plethora of tutorials available to get started.
Make sure to exclude the Employee ID from the analysis, since otherwise you run the risk of your model just memorising which employees have left!
You can train your model on the 2019 data and evaluate on the 2020 data. If it performs well then you know your model is robust enough to learn patterns and apply them on a cohort of employees a year in the future.
It means that you can analyse the current cohort of employees and produce a ranking of those who are most at risk of dropping out, allowing your human resources department to target retention measures effectively.
(In practice I would not just make a snapshot on a single date but rather take various snapshots throughout the year, trying to keep even amounts of data from every month to eliminate bias from seasonal effects. There is no reason to limit the monitored period to 1 month either – you can always train it to predict attrition in the next year or decade if you have enough data.)
Of course your model will not have foreseen the Covid-19 pandemic of 2020. This will always be a limitation of machine learning, which involves learning patterns from the past to apply to the future. However you can design any system using your model to allow for a manual ‘adjustment factor’, for example to let you adjust attrition for all employees by a user defined constant during an economic downturn.
Most machine learning models will allow you to look inside and analyse how they are making the decisions that they return. This is called model explainability, or feature importances:
If you find that distance from home to office is a major factor in attrition then you can adjust recruitment policy to prioritise candidates who live close by, or include a relocation package, or put on a company bus or car pooling scheme. Of course you can’t use this information to discriminate on age.
The benefits of applying a model like this extend beyond its pure prediction capabilities towards insights that can modify the operations of the organisation as a whole. The cost savings to the organisation are two-fold as HR professionals can use the model’s explanations to develop retention policies across the business and also target high risk individuals with retention initiatives.
Other than the classification model described above, I can think of two other ways I can think of that you could try to model turnover.
Firstly – instead of classification, why not try a regression model to predict the total time that an employee will stay in the business from a given date?
Actually I would not recommend to use regression for the following reasons. For the employees on 1 January 2019, we know how much longer everybody stayed in the company up to today, 29 April 2020, i.e. 484 days. For anybody who is still in the business at day 484, we know that their total stay is greater than or equal to 484 but we can’t define it. You would have to think of a workaround for the model. If you set the stay to 484, or any arbitrary large value, then you are introducing a bias which a regression model will not handle correctly. If you simply exclude these people you will introduce another bias. Statisticians would say that our data is right censored.
If you wanted to use machine learning to predict the total remaining time someone will stay in the business, then I would suggest to train a separate classification model to predict attrition within 1 month, 2 months, etc, and combine them when you want to make a prediction.
As another alternative to the classification model, one possible alternative tool we could use from statistics might be survival analysis.
This is used for example in clinical studies on diseases with high fatality rates (mainly cancer, heart attack and stroke patients), to analyse the proportion of a starting cohort or patients who have not died by various points in time.
The survivals can be plotted on a curve called a Kaplan-Meier curve:
You can also calculate a number called the Kaplan-Meier estimator which is an approximation of the survival rate at any point in time.
Survival analysis is robust to right censoring and so could be used to analyse employee attrition on a longer time scale than the machine learning model, however it becomes more complex to use when we are predicting from lots of independent variables (commute length, age, pay grade etc).
I am not aware of any businesses using survival analysis to predict employee attrition but I would be interested to hear if anyone is doing it.
Automated machine learning is software which in theory allows anybody to design, train and deploy machine learning models to production environments without needing to write any code. It is often a drag-and-drop experience similar to PowerPoint.
You may have heard a lot about automated machine learning recently. Example include Microsoft’s Azure ML Studio, Google’s Cloud AutoML and Amazon’s AWS AutoPilot, among others.
On 7th April Forbes even ran the headline AutoML 2.0: Is The Data Scientist Obsolete? (Their conclusion: no they aren’t.)
In fact according to the marketing literature of the companies selling automated ML, there is no need to hire data scientists any more. Automated ML will democratise data science and allow non technical people to build their own models.
However I have tried out a couple of these tools and found that although they are extremely useful, they by no means automate even half of my work.
What’s the catch?
For one, if you look through the examples in the tutorials of any of these platforms you will see that you nearly always need a nice neat table of your customers’ banking history, with a final column of 0’s or 1’s indicating if they were granted a loan.
In real life, the organisation building the model would not have a nice table of clean data lying around like this. A person’s banking or purchase history will be spread over many rows of different tables in different systems. You would have several iterations of finding the different data sources and joining them up into the format that the automated ML tools expect. You will spend a lot of time pestering managers in remote departments of the company to give you access to data. It is this data gathering and cleaning (as well as pestering) which often makes up 90% of a data scientist’s job.
Furthermore, when you dig into the tutorials of these packages, the automated ML tools only allow you to do an extremely limited set of things using the drag and drop interface, and once you get away from the beginners’ examples you find yourself having to start programming in Python to use the automated ML libraries. I think this would always be inevitable: nobody seriously suggests that software development will be replaced by a drag and drop interface so why are we having this conversation about data science?
Having said that, there are some things that I found automated ML to be extremely useful for. Often once we have done the data preparation step I defined above, we end up doing a painstaking search through many different ML algorithms (Random Forest, Gradient Boosted Tree, Neural Networks etc…) with all different configurations. With one of the automated ML packages, you can be coding in Python and simply train an automated ML model, and under the hood the software will run every algorithm in its toolbox and pick the best performing one.
I have been using Azure ML for my last few projects (predictive models in healthcare) and I found that in terms of accuracy it outperformed the basic models that I was building in Scikit-learn, and was quicker to use as well because I only had to write a few lines of code.
In conclusion I think that automated ML allows data scientists to be more productive and is another useful tool in a data scientist’s repertoire. In addition it provides a degree of democratisation by allowing non-data scientists to see and participate in data science for the first time. But nobody’s job is going to be automated just yet.
Ryohei Fujimaki, AutoML 2.0: Is The Data Scientist Obsolete?, Forbes (2020)
An overview of some of the projects I have been involved with in the past.
A large retail company had GPS records of vehicle telematics. I built an ML model to produce predictions of how long it takes to unload a vehicle and close the loading bay door, taking into account product types, time of day, and other variables. The predictive model had a constraint that it should return a prediction within a few milliseconds. The model was deployed and integrated into their traffic planning software, allowing the company to work with more accurate schedules, improving efficiency.
An internet based company had a signup form where users would upload some text files and then fill out a large amount of small text and dropdown fields. By training a machine learning model on the past data I was able to accurately predict some of the values, allowing some fields to be removed from the form. In an A/B test this was shown to improve conversions.