The Myth of the Impartial Machine

From voice assistants to image recognition, fraud detection to social media feeds, machine learning (ML) and artificial intelligence (AI) are becoming an increasingly important part of society. The two fields have made enormous strides in recent years thanks to gains in computing power and the so-called “information explosion.” Such algorithms are being used in fields as varied as medicine, agriculture, insurance, transportation and art, and the number of companies rushing to embrace what ML and AI can offer has increased rapidly in recent years.

According to a survey conducted by Teradata in July 2017, 80% of enterprises have already begun investing in AI technologies and 30% plan to increase their spending over the next 36 months. Investment in such models is also forecasted to grow from $12 billion in 2017 to $57.6 billion by 2021. Billed as being more accurate, consistent and objective than human judgment, the promises and expectations of what ML and AI can achieve have never been greater.

What’s the difference between Artificial Intelligence and Machine Learning?

Artificial intelligence and machine learning are often used interchangeably but there are in fact differences between the two.

Artificial intelligencerefers to the broader science of getting computers to act intelligently without being explicitly programmed.

Machine learningis the use of statistical algorithms to detect patterns in large datasets. It is one way in which computers can become better at a task and thus considered to be a subset of artificial intelligence.

However, for every neural network that can defeat Jeopardy champions and outplay Go masters, there are other well-documented instances where these algorithms have produced highly disturbing results. Facial-analysis programs were found to have an error rate of20 to 34 percent when trying to determine the gender of African-American women compared to an error rate of less than one percent for white men. ML algorithms used to predict which criminals are most likely to reoffend tended to incorrectly flag black defendants as being high risk at twice the rate of white defendants. A word embedding model used to help machines determine the meaning of words based on their similarity likewise associated men with being computer programmers and women with homemakers .

If data-trained models are supposed to be objective and impartial, how did these algorithms get things so wrong? Can such bias be fixed?

The Machine Learning Pipeline

Being able to use data to meaningfully answer questions via machine learning requires several steps. Before getting into the details of bias, it is important to understand them.

Data gathering. All machine learning models require data as inputs. In today’s increasingly digitized world, data can be derived from various sources including user interactions on a website, collections of photo images and sensor recordings.
Data preparation. Data collected are rarely in a usable state as-is. Data often need to be cleaned, transformed and checked for errors before they are ready to be fed into a model.
Split dataset into training and testing sets. The training dataset is used to build and train the model while the testing dataset, which is kept separate, is used to evaluate how well the model performs. It is important to assess the model on data it has not seen before in order to ensure that it has indeed learned something about the underlying structure of the data rather than simply “memorized” the training data.
Fit and train models. This is the step where various types of ML models such as regression models, random forests and neural networks are built and applied to the training data. Models are iterated on by making small adjustments to their parameters in order to improve their performance with the goal of generating the most accurate predictions possible.
Evaluate model on the test dataset. The top performing model is used on the testing data to get a sense of how the model will perform on real world data it’s never seen before. Based on the results, further refinement and tuning of the model may be needed.
Make predictions! Once the model is finalized, it can begin to be used to answer the question it was designed for.

Sources of bias

There are two key ways bias can be introduced and amplified during the machine learning process: by using non-representative data and while fitting and training models.

Biased data

Data that are non-representative and biased.

When one examines a data sample, it is imperative to check whether the sample is representative of the population of interest. A non-representative sample where some groups are over- or under-represented inevitably introduces bias in the statistical analysis. A dataset may be non-representative due to sampling error and non-sampling errors.

Sampling errorsrefer to the difference between a population value and a sample estimate that exists only because of the sample that happened to be selected. Sampling errors are especially problematic when the sample size is small relative to the size of the population. For example, suppose we sample 100 residents to estimate the average US household income. A sample that happened to include Jeff Bezos would result in an overestimate, while a sample that happened to include predominantly low-income households would result in an underestimate.

The Machine Learning Pipeline

Sources of bias

Biased data

Data that are non-representative and biased.

Recommend

Explained: Futures in Rust for Web Development

7 Tools for Developing Web Components in 2019

GitHub - wanchain/go-wanchain: Wanchain Client Source Code

刚才有一件有意思的事。有一位老师，是我很尊敬的一位师长，突然给我发消息，说，我有...

如何看待上海各大医院因仁济医院医生被铐走事件纷纷取消专家门诊「加号」服务？ - 知...

如何看待有人说相机raw格式基本没用的说法？ - 知乎

怎样推测星系中央有两个大质量黑洞而不是一个？ - 知乎

为什么星系中心几乎都有一个黑洞？ - 知乎

推荐！数据可视化的十种优秀JavaScript图表库

“你买了玩具哦？”“嗯！( •̅_•̅ )”

About Joyk