Acquiring the un-acquirable. Rarely taught data skills – that you need to know - JOYK Joy of Geek, Geek News, Link all geek

There is both a shortage of data scientists and a subsequent demand to train a growing number of them. Increasingly this is happening in classrooms, but a high proportion of practicing data scientists are still self taught (more than 1 in 4 in recent Kaggle research). Lifelong learning is also important, and whether you are a formally trained data scientist or self-taught, you will continue learning and will rely on a combination of MOOC’s, textbooks and other self teaching to do so.

So with all this learning happening, for the benefit of both future and present data scientists, what are the professional skills that are important to know but difficult to teach? Where is the biggest gap between data in real life and in theory? Here are 4 areas that are unlikely to ever be taught proficiently, but are a critical part of your success and ability to achieve data science outcomes in the workforce.

Getting and cleaning the data

I guarantee you one thing. Starting a data science exercise in the real world, will not entail clicking a link, downloading a CSV file and then proceeding on your merry way. It is much more complicated (and time consuming) than that.

Firstly, you most likely will need to find the data sources that you want to use, and start stitching the data together to begin to form something that resembles what you wish to work with, before doing an extensive exercise to ensure that the data is clean and useable. In parallel you will be getting an understanding of the business rules that may have been used to create any features that you are not immediately familiar with, to ensure that you are using them correctly. Add to this any feature engineering, imputation and further additive actions that you may want to make, and chances are that you have just spent a considerable amount of your overall project time, just on getting ready.

To do this takes a combination of business, technical and soft skills as well as a dash ofimagination, skills that are hard to teach. Definitely do not underestimate how much data preparation and sourcing you will be doing – and how excelling in this area, enables the remainder of your work.

Managing workflow

For the integrity of most teaching guides, the data science process is explained using one programming language or tool, with no deviation from said tool. This gives off the impression that everything can and should be achieved within one tool. This certainly is possible, but it’s unlikely to be effective.

To be effective you most likely will be jumping from tool to tool as you go through the various stages of the data science process. You will be doing your EDA in a separate tool to that which you prepare your data, which again may be different to the tool you use for visualization. Learning which tools work for you and your organisation, to achieve the end outcomes that you desire in the shortest amount of time is one of theprimary skills in data science, however one that is rarely taught. Be tool agnostic and focus on operating at achieve actionable results.

Story telling and using data to influence

Although some courses run over some of the basic visualisation techniques, very few actually talk about how to tie data science results to business outcomes and how to effectively communicate to non technical audiences, even though this is often cited as a key data science skills.

How do you communicate the results that you have created to an audience that does not care for loss matrices, confidence intervals and the like? Essentially this is a combination of learning ‘soft’ communications skills and learning how to present and influence decision makers through effective presentation techniques (as well as a douse of effective stakeholder management).

I will be touching on this in later posts in detail, but I will say two quick things that you can do to immediately to become more effective in this area.

Learn how to do a financial projection to determine the future value of the insights that your models produce–

Learning how to do a financial projection of any insights that you generate and then using that as a means of prioritising those insights – is a very effective yet seldom used action. Putting a value on the work that you have done, prior to presenting it – speaks directly to your more commercial stakeholders – and directly speaks the language that informs business decision making.

If you are uncomfortable working with commercial forecasting models at least reach out to a commercial team member within your organisation and ask them to model out the value of the business of implementing any actionable insights that you generate.

For example if you are deploying an algorithmic attribution model for your marketing team, which allows them to move away from a last click / linear/ or first click attribution model, try to figure out what value this may bring in terms of uplift in ROI, via better budget allocation decisions for your team versus the previous methodology. Or how much time will an NLP model that gets deployed allow your staff team to save in doing rote tasks? And what value does that time translate to?

It’s not easy and it involves making educated assumptions, however even a ‘back-of-the napkin’ sketch of the forecast value, is better than nothing, and goes along way in bridging the gap between data and commercial teams.

Utilise builds and tell a story.

Telling the story behind the process that you undertook will help your stakeholders understand and appreciate your work. This should be done at both the macro (“I had this problem, that I wanted to solve using this data set and the methodology I used was this, which allowed me to achieve these results…”) as well as micro level.

At the micro (insights) level it can be effective to use builds. A build is essentially building a story to convey the message of a single data insight, often done by making small progressive call-outs to the same graph / visual.

Say that you did a (customer) clustering exercise. Your first slide could be a scatter plot of all the data points that you had to begin with. The second would be a visual build on top of the first showing the key clusters that were identified. You then may have a few slides zooming into each cluster and looking into the characteristics of each, before representing all the results in a visual (but not graphical) view, where you perhaps further humanise the clusters that you have identified (say by giving them a persona).

Utilising builds, essentially allows you to incrementally reveal information – and weave a narrative around your insights – leading your audience towards their conclusion – rather than just lumping data on them.

In short a story at the insight level will go: “I had this data, and I was able to discover this, which is relevant because of this, and can be actioned by doing this…”

Good documentation is key to good teamwork

Ask yourself this. How do you document work and how do you make your work easily replicable for a future team-member? How do you capture your learning so that they can be scaled and used as a basis for future analysis?

You most likely will be working as part of a team, so be the ultimate team player and make people’s (work) lives easier – through good documentation.

This one would not have made my list, even a few months ago, but I think it is growing in importance to data scientists. The reason? Effectiveness.

If I argue for anything, it is that data scientists choose the path of least resistance to being effective. So how does having good documentation allow you to be more effective?

Good documentation allows you to

convey your message and approach clearly to a technical audience (thus minimising time spent understanding approach and outcomes) allowing for cross-collaboration and quick up-skilling. The macro story telling that I referred to above – but directly to your peers and immediate team members
Potentially make code (or parts of code) reusable by someone that is either looking to take a similar approach, or use a similar dataset. Why clean the data twice, if you have documented it clearly in the first place for example?
Manage upwards. Have a manager that likes to be in the detail? Make their job easier by organising your thoughts and documenting your approach, saving both you and them time.

The documentation that I am advocating here, goes beyond commenting code. Essentially you want to lay out the process that you undertook and the results that you achieved. The vision here is that you will be using a Jupyter notebook or R Markdown to retroactively describe the steps that you took, and the methodology that you used. Retroactively is key here – because when you are doing the work – you are optimising for results at speed. This entails going back and reflecting on what you have achieved and documenting it for the benefit of others.

In such a document there will be code snippets directly lifted from your work environment (including already commented code), now you are just adding rich text elements to add further context. As a minimum such a document will have:

An introduction
A description of the problem
A description of the data set
The methodology that you used:
- Methodology to prepare the data
- Machine Learning / statistical analysis approach taken to achieve the results
The Results
Recommendations based off the results

Once you have the document, it’s audience is other technical people within your group, who may wish to replicate the results, learn from them, or leverage them to do something similar. Just file it away and add it to your portfolio.

Conclusion

So, if you are still at university, or are learning new skills to compliment your existing ones, keep an eye out on how these skills can be acquired, who does them well within your immediate circle of peers, and begin to model their behavior, while experimenting with your own variations of each and integrate them into your own skillet. Not every data science skill can be taught and usually those that are difficult to acquire – can make a big difference.

Acquiring the un-acquirable. Rarely taught data skills – that you need to know

Getting and cleaning the data

Managing workflow

Story telling and using data to influence

Good documentation is key to good teamwork

Conclusion

Recommend

Layout in iOS [FREE]

你知道，HTTPS用的是对称加密还是非对称加密？

Issue 210 – December 10th 2019

程序员需要了解的硬核知识之控制硬件

Refinery: SQL Migration Toolkit for Rust

2019年前端大事件回顾：流年笑掷，未来可期

深度学习浪潮下的自然语言处理，百度NeurIPS 2019展现领域新突破

16+ Articles of December 2019 to Learn JavaScript

花椒web端实时互动流媒体播放器

国行 Switch 上手：亲民售价+一年保修，你会买单吗？

About Joyk