3

Statistics for Programmers - Measures of Central Tendency

 1 month ago
source link: https://nishtahir.com/2-statistics-for-programmers-measures-of-central-tendency/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

By Nish Tahir in Statistics For Programmers — Mar 23, 2024

Statistics for Programmers - Measures of Central Tendency

Central tendency measures are valuable statistical values that shed light on the "typical" or central value around which data tend to cluster. Let's delve into the three primary measures of central tendency: the Mean, Median, and Mode.

The Mean, also known as the average, is the most commonly used measure of central tendency. To calculate it, we sum all the values in a dataset and then divide by the number of data points.

Mean=∑i=1nxin

Where:

  • n is the total number of data points in the dataset.
  • xi represents each data point in the dataset.

The ∑ sign or summation (sum for short), signifies that we want to add up all the values in the series x, which can be expressed as x1+x2+x3….

Let's see this in action using an array of measurements:

const arr = [80, 95, 93, 95, 5];

By applying the following function, we compute the mean of the values:

function mean(arr) {
  const n = arr.length;
  
  let sum = 0;
  for (let i = 0; i < n; i++) {
    sum += arr[i];
  }
  return sum / n;
}

The mean for our example array is:

console.log(mean(arr));
// Output: 73.6

The result appears significantly lower than most values in our dataset because the Mean is sensitive to "outliers."

Outliers are values that deviate significantly from the rest of the data and can pull the mean up or down when included in the dataset. By removing the outlier from our data, we can observe a more intuitive mean value:

const arrWithNoOutliers = [80, 95, 93, 95];
console.log(mean(arrWithNoOutliers));
// Output: 90.75

Median

The Median seeks to find the center point of an ordered dataset, where half of the data is less than the value and the other half is greater. To calculate it, we sort the data in either ascending or descending order and locate the value at the center of the array.

Median=x(k+1)

Where:

  • x(k+1) is the value at the (k+1)-th position when the data is arranged in ascending or descending order.

Here, k is an index representing the center of the array, and we can calculate it as k=n−12, where n is the length of the array.

Let's express this as code:

function median(arr) {
  const sorted = arr.sort();
  const n = sorted.length;
  const k = (sorted.length - 1) / 2;
  
  return sorted[k];
}

Applying this to our earlier sample array, we get the following output, which is consistent with our data:

const arr = [80, 95, 93, 95, 5];
console.log(median(arr));
// Output: 93

However, there's an important observation to make. This method only works with datasets of odd lengths. When given an even-numbered dataset, the formula yields a fractional index, which is invalid.

const arr = [80, 95, 93, 95];
console.log(median(arr));
// Output: undefined

To address this situation, we can update our formula to calculate the Mean of the two middle values:

Median=x(k)+x(k+1)2

Similarly, we need to update our function:

function median(arr) {
  const sorted = arr.sort();
  const n = sorted.length;
  const k = Math.floor((sorted.length - 1) / 2);

  if (n % 2 === 0) {
    return (sorted[k] + sorted[k + 1]) / 2;
  }

  return sorted[k];
}

Now, we calculate the midpoint as before but round down to the nearest whole number to ensure a valid index. Then we compute the mean of the two values.

Testing this function with an even-numbered dataset provides us with the following result:

const arr = [80, 95, 93, 95, 5];
console.log(median(arr));
// Output: 94

The mode is simply the value that occurs the most number of times in a given dataset. Although it may be one of the simpler measures, expressing it can seem complex due to the mathematical notation involved:

Mode=arg⁡maxx∑i=1n(xi=x)

Let's break down this expression into its components to understand the steps involved:

  1. The summation symbol ∑i=1n indicates that we will sum up the result of the nested expression for each value xi in the dataset.

  2. The expression (xi=x) acts as a condition or indicator function. It returns 1 if xi, the value at the given index, is equal to the value under consideration x, and 0 otherwise. By using this function with the summation, we can count the number of times each value x appears in the dataset.

  3. Finally, the arg⁡maxx before the summation selects the argument that maximizes the expression. In this context, we are looking for the value of x that has the maximum number of occurrences among all unique values.

Translating this mathematical expression into code provides further insight into the process:

function mode(arr) {
  let argMax = null;
  let maxFrequency = 0;
  let n = arr.length;

  for (let i = 0; i < n; i++) {
    const x = arr[i];
    let frequency = 0;

    for (let j = 0; j < n; j++) {
      if (arr[j] === x) {
        frequency++;
      }
    }

    if (frequency > maxFrequency) {
      maxFrequency = frequency;
      argMax = x;
    }
  }

  return argMax;
}

For each unique value in the array, we count the number of times it appears in the array and select the value with the highest frequency. Let's apply it to a sample dataset to see its output:

const arr = [1, 2, 3, 2, 4, 2, 5, 4, 2];
console.log(mode(arr));

// Output: 2

While this function correctly computes the mode for the given input, it is not optimized for efficiency. The nested loop in the array causes the number of operations needed to compute the solution to grow quadratically (( O(n^2) )) with the input size.

An optimized version of this function may use a map to remember the number of times a unique value has been seen:

function mode(arr) {
  const frequencyMap = {};
  let maxFrequency = 0;
  let mode = null;

  for (let i = 0; i < arr.length; i++) {
    const x = arr[i];
    const currentFrequency = frequencyMap[x];

    if (currentFrequency === undefined) {
      frequencyMap[x] = 1;
    } else {
      frequencyMap[x] = currentFrequency + 1;
    }

    if (frequencyMap[x] > maxFrequency) {
      maxFrequency = frequencyMap[x];
      mode = x;
    }
  }

  return mode;
}

Applying this optimized version to our sample dataset yields the same output:

const arr = [1, 2, 3, 2, 4, 2, 5, 4, 2];
console.log(mode(arr));

// Output: 2

With this optimized approach, the function's efficiency improves, providing a more practical solution for larger datasets.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK