11

Why Data Scientists Should Learn Causal Inference

 3 years ago
source link: https://leihua-ye.medium.com/why-data-scientists-should-learn-causal-inference-a70c4ffb4809
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Experimentation and Causal Inference

Why Data Scientists Should Learn Causal Inference

Climb up the ladder of causation

1*vYsTwEb9x12PHYWGOI1fVA.jpeg
Photo by Sudan Ouyang on Unsplash

Nobel Prize Goes To …

By now, you should have heard that three Economics methodologists — David Card, Joshua Angrist, and Guido Imbens — won the Nobel Prize. Their contributions to research methodology (i.e., Causal Inference) both cheer up and puzzle the data community:

What is Causal Inference anyway?

How does it differ from other tracks of Data Science?

As an ex-academic working in the tech sector, I’ve been exposed to both sides of the fence and become quite familiar with their distinctive use cases. In today’s post, let’s start with conceptual clarifications and the centrality of causal reasoning in business decision-making. Then, we move on to elaborate on the reasons why Data Scientists should start adopting a causal mentality and how they can do so.

Data Science as A Field

Data Science is an umbrella concept that includes a wide range of sub-fields, which require different data skills. They follow either correlation- or causation-based tracks. Machine Learning is probably the poster boy in the correlational track and stealing the thunder right now. In contrast, its causal sister is less prominent but deserves more attention in the industry.

As Prof. Judea Pearl, the 2011 Turing Award winner, puts it:

“Machine Learning systems have made astounding progress at analyzing data patterns, but that is the low-hanging fruit of Artificial Intelligence.

To reach the higher fruit, AI needs a ladder, which we call the Ladder of Causation.”

From his WSJ report “AI Can’t Reason Why

In many real-life scenarios, merely knowing two things are related is not actionable; instead, we want to move up the ladder of causation and answer these “what if” questions:

  • What if we had rolled out the feature 6 months ago?
  • How would the new subscription service affect user engagement in the long term?
  • What if we had chosen another optimization strategy? What would its impact on revenue be?

Machine Learning or any other correlation-based methodologies can’t answer these counterfactual questions. To move up the ladder of causation, Data Scientists need to develop a new set of data skills called causal reasoning.

Causal Inference In Three Ways

There are three ways of establishing a causal relationship: Randomized Controlled Trials (RCT, aka. A/B testing), quasi-experimental designs, and observational designs. RCT is the cleanest way of doing causal inference. Quasi- and observational- designs come in handy when A/B testing is off the table. The non-experimental designs deliver highly trustworthy results by making testable and explicit statistical assumptions.

In the next section, let’s delve into each of these designs.

Option 1: A/B Testing

As the common statement goes, A/B testing is the gold standard of causal inference. This is so because we have complete information about the Data Generation Process (DGP). In the Online Experimentation settings, Data Scientists have strict control over the treatment assignment process that facilitates causal attribution. In addition, the ability to collect millions of users at ease renders the favorable statistical properties of our estimators.

However, RCTs are not universally practical nor applicable for all types of business questions. Just to name a few testing scenarios. Let’s say the testing setup would be unpleasant for end users. They may churn if the level of unpleasantness reaches the tipping point. A/B testing is risky under the scenario, and we would be better off with viable alternatives.

Here is another reason why RCTs are not desirable. In marketing analytics, we have to deal with user interference, which violates the Stable Unit Treatment Value Assumption (SUTVA) assumption for valid causal inference. Therefore, we need to adopt more advanced designs above the user-level randomization.

If the qualifying conditions are too restrictive and only a few users are qualified for the test, it will take months to reach the required sample size. The delayed feedback loop doesn’t work well with the A/B testing framework, which is designed to capture short-term effects.

Due to the restraints mentioned above, Data Scientists are eager to find available options and alternatives that offer less optimal but good enough answers in a timely manner. Quasi- and observational designs can help.

Option 2: Quasi-Experimental Designs

The word — “quasi” — means that it only contains partial information but not full control over the DGP. The incomplete information makes the quasi-experiment family less perfect than the RCTs from the perspective of causal inferential power.

Due to the lack of randomization at the baseline, the biggest challenge is to figure out how to set up a proper “apple-to-apple comparison for the quasi-experimental designs. Different quasi-designs make different assumptions, and please check the following posts for some common designs:

Option 3: Observational Designs

The observational methods are considered the last resort if the other two approaches are unavailable. Data Scientists have no prior information about the DGP. They rely on statistical assumptions to ensure the designs are appropriate for the business question at hand.

One caveat is that observational designs are vulnerable to metrics and model selection. As a result, the observational and experimental designs may reach different conclusions. For example, researchers at Facebook once found out that observational and experimental designs lead to different estimates of the marketing effects (paper link).

My rule of thumb is to have a directional take from quasi-experimental and observational designs. If we want a precise estimate, conducting an RCT is recommended.

The observational approach centers around the Propensity Score and various ways of dealing with PS:

Medium recently evolved its Writer Partner Program, which supports ordinary writers like myself. If you are not a subscriber yet and sign up via the following link, I’ll receive a portion of the membership fees.

Online Resources

As a Data Scientist, I prioritize self-growth and allocate specific research time to active learning. Then, applying what I’ve learned from the book to my day job. The iterative input-output process improves my Data Science craft quickly.

I include the following resources for their effectiveness and top quality; no affiliated interest.

My overall approach is to be iterative and start with the big picture. Nowadays, we are trapped in the sea of information, and we need to be selective with our learning strategy.

As a first step, I’d suggest taking the courses to build an overall understanding of the field. It saves us a ton of time and keeps us on track in the early days. Later, we can choose a textbook and learn at our own pace. In the final stage, focus on the application and review top tech companies’ technical blogs.

Online Courses

  1. A Crash Course in Causality: Inferring Causal Effects from Observational Data, Coursera

Reasons for recommendation: This is the best online course I’ve taken on the topic. The instructor is Prof. Jason A. Roy.

2. Measuring Causal Effects in the Social Sciences, Coursera

Reasons for recommendation: A good refresh of the methods learned in the previous course.

3. Causal Data Science with Directed Acyclic Graphs, Udemy

Reasons for recommendation: There are two math notations in the causal inference track, and this course introduces the notations (DAG) used more often by computer scientists and ML researchers.

4. Causal Inference — Online Lectures (M.Sc/PhD Level), by Prof.
Ben Elsner

Reasons for recommendation: they are free YouTube videos offered by an Economics professor. Who wouldn’t want them?

Textbook

  1. Angrist, J.D. and Pischke, J.S., 2008. Mostly harmless econometrics: An empiricist’s companion

Reasons for recommendation: this is the Bible of research design for grad students. Slightly advanced readings.

2. Wooldridge, J.M., 2016. Introductory econometrics: A modern approach

Reasons for recommendation: lay the foundation for econometric methods; it’s a go-to book when you get stuck.

3. Judea Pearl and Dana Mackenzie, 2018, The Book of Why: The New Science of Cause and Effect

Reasons for recommendation: this book lies in the intersection between causation and Machine Learning.

4. Impact Evaluation in Practice, by World Bank

Reasons for recommendation: it’s an easy read and covers a wide range of topics in Program Evaluation.

Takeaways

  • There is a common misunderstanding in the field: Data Science = Machine Learning. As a result, the follow-on discussions center around the algorithm choice or data.
  • However, these correlation-based approaches can’t answer causal questions. Data Scientists need advanced causal reasoning skills to render actionable insights.
  • There are several ways of doing causal inference. Recognizing the proper use cases of different designs trumps the one-fits-all approach.
  • Focus on self-learning and be patient with our learning curve.

Enjoy reading this one?

Please find me on LinkedIn and YouTube.

Also, check my other posts on Artificial Intelligence and Machine Learning.

</div


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK