7 ways to catch a Data Scientist’s lies and deception

7 simple principles to make sure you’re not being taken advantage of by someone selling you “AI” and “Machine Learning”

May 21 ·7min read

W hether you are a business leader, entrepreneur, angel investor, part of your company’s middle management, judge at a hackathon or someone involved in ‘ tech’ , at some point you are likely to end up in a situation where someone is trying to ‘sell’ you their “AI product”, “Machine Learning software” or some other fancy fusion of buzz words. If you find yourself in such a situation, it is natural to feel like you do not have ample knowledge and expertise to make a a sound decision. Stand your ground and don’t be overwhelmed! The following are 7 common sense ways that will help you in separating the signal from the noise. These will help you in cutting through the BS and aid you in understanding the core value proposition of the Machine Learning solution that you are being sold on.

1. “We used A.I. to…”

Source: https://media2.giphy.com

Be very careful when someone says “AI”. While it is probably fanciful marketing, it could also be a sincere effort at trying to abstract away painfully complicated details so as to not bother you. Give them the benefit of the doubt BUT delve into the details. Find out more about which specific Machine Learning model they used. Here are a few other critical questions to ask:

Which other methods (models/algorithms/techniques) did you try and how were the results compared to the chosen solution? (ask for graphical evidence if possible)
Why did you choose this method over the others?
Why do you think this method outperforms the others on this data?
Has someone else solved a similar problem? If yes, which method did they use?

At first, you may not necessarily understand all the details of the answers to these questions but you should ask, clarify and understand as much as you can.

If you can’t explain it simply, you don’t understand it well enough. — Albert Einstein

In my experience, I have not come across a single Machine Learning concept that cannot be explained by analogy. So, ask for a high level explanation, if communicating too many technical details is a challenge. Such scrutiny will not only expand your understanding, it will also indicate how well the solution has been thought through. (It will also establish that your meeting room is a no-BS-zone :sunglasses:)

2. Survival of the Adaptable

Source: https://i.pinimg.com

In the 1990s and early 2000s, a spam filter in your email inbox would look for spelling errors and other simple indicators to automatically put the spam emails into the spam folder. Now, spammers have become smarter and spam emails have become increasingly difficult to detect. The Machine Learning models used by modern email providers have had to adapt and become more sophisticated in identifying spam emails correctly.

“All failure is failure to adapt, all success is successful adaptation” — Max Mckeown

One thing you must clarify is, as time progresses and input data evolves, how readily can the Machine Learning model be re-trained on new data or replaced with a more performant model . This is essential as you deserve to know if there is an ‘expiry date’ on the solution you are being sold.

3. Garbage In Garbage Out

Source: https://media.tenor.com

A Machine Learning model is only as good as its data. Therefore, you should ascertain the quality of the data used to train a Machine Learning model. While “quality” is difficult to define and might differ depending on the context, one simple way to find out about the quality of training data is to ask — how similar and representative is the training data compared to the ‘real world’ data that the model will be facing.

“In God we trust, all others bring (good quality) data.” — W Edwards Deming

No matter how fancy or cutting-edge a Machine Learning model might be, if the data on which it is trained is of poor quality, the results are bound to be lousy.

4. More, more, more!

Source: https://media1.tenor.com/

In general, the more data a model has been trained on, the better it performs (ceteris paribus). This is especially true for Deep Learning models. You can think of a Machine Learning model as a high school student practicing questions for SATs. Practicing a larger number and variety of questions will increase the likelihood of the student performing better on the SATs.

“It is a capital mistake to theorize before one has (ample) data.” — Sherlock Holmes

It is essential to ensure that ample data has been used in training any Machine Learning model.How much data is enough? It is difficult to say how much data is needed, but the more the better! Ideally, the data should come from reliable sources and these sources should be used exhaustively.

5. Interpretability

Source: https://lh3.googleusercontent.com

In Machine Learning, there is often a trade-off between how well a model performs and how easily its performance, especially poor performance, can be explained. Generally, for complex data, more sophisticated and complicated models tend to do better. However, because these models are more complicated, it becomes difficult to explain the effect of input data on the output result. For example, let us imagine that you are using a very complex Machine Learning model to predict the sales of a product. The inputs to this model are the amounts of money spent on advertising on TV, newspaper and radio. The complicated model may give you very accurate sales predictions but may not be able to tell you which of the 3 advertisement outlets, TV, radio or newspaper, impacts the sales more and is more worth the money. A simpler model, on the other hand, might have given a less accurate result, but would have been able to explain which outlet is more worth the money. You need to be aware of this trade-off between model performance and interpretability. This is crucial because where the balance should lie on the scale of explainability vs performance, should depend on the objective and hence, should be your decision to make.

6. Measuring the Right Thing in the Right Way

Source: https://media2.giphy.com

Accuracy is a very common metric for measuring the performance of a classification Machine Learning model. For example, a Machine Learning model for classifying pictures of cats and dogs, with an accuracy of 96% could be considered very good. It means that out of a 100 pictures of cats and dogs, the model is able to guess 96 pictures correctly. Now imagine a bank tries to apply the same metric to classifying fraudulent transactions. The fraud classifier might easily have an accuracy of 96% because fraudulent transactions are very rare. However, catching fraudulent transactions is not really about being right 96% of the time. It is about being less wrong and being able to catch as many of the fraudulent transactions as possible, because wrongly classifying 4% of the transactions as not fraudulent could do a whole lot of damage.

Measurement is fabulous. Unless you’re busy measuring what’s easy to measure as opposed to what’s important. — Seth Godin

For the bank-fraud example, the number of false negatives is more indicative of the performance of the model than accuracy. Some other metrics, such as precision, recall, specificity and F1 score, should be used instead of accuracy depending on the problem. Here is an awesome article by Mohammed Sunasra that talks about when each of these should be used. Thus, it is critical to be mindful of using the right metric and a variety of metrics if possible.

7. So…what are your strengths and weaknesses?

Source: https://i2.wp.com

A cliche in the world of corporate interviewing, the strengths-weaknesses question can come in very handy when trying to evaluate a Machine Learning solution. When someone proposes a Machine Learning solution, you should definitely ask them about the limitations of their solution . It is essential to know the limitations to answer two key questions:

Do the strengths outweigh the limitations enough to implement the solution?
Could the limitations hamper performance in the future?

“The key to success is understanding one’s weaknesses and successfully compensating for them. People who lack that ability fail chronically.” — Ray Dalio

From the standpoint of implementing an effective and sustainable Machine Learning solution, knowing its limitations is critical to its success. Moreover, asking the proponents to come clean on the limitations of their solution will give you an idea of the level of transparency that they have. It will indicate how well the solution has been thought through and how trustworthy the people proposing the solution are.

Conclusion

Regardless of how lacking in knowledge and overwhelmed you might feel, you have one secret weapon that can help you — a flashlight to guide you through the fog. That secret weapon is your ability to ask questions. Ask questions! Question, clarify and scrutinise everything you are not sure about. These 7 aforementioned ideas will give you a holistic strategy and 7 critical dimensions along which to ask questions. You can count on them to enhance your understanding and soundly evaluate a Machine Learning solution.

1. “We used A.I. to…”

2. Survival of the Adaptable

3. Garbage In Garbage Out

4. More, more, more!

5. Interpretability

6. Measuring the Right Thing in the Right Way

7. So…what are your strengths and weaknesses?

Conclusion

Recommend

瑞幸咖啡厦门裁员50%，大多为研发人员_创事记_新浪科技_新浪网

The story about a few imports

腾讯云 Serverless HTTP 服务指南

Arduino Pinball Machine That Plays Itself

改版 or 不改，一款好的持续交付系统应该长什么样子？

Blunt – A CSS Layout Framework for Minimalists

商用一年大盘点：5G到底发展得怎么样了？

4天蒸发400亿市值瑞幸必将“死路一条”吗？

CGG GeoSoftware与阿里云达成合作，云技术助力油气勘探

美团无人配送CVPR2020论文CenterMask解读

About Joyk