Skip to main content

Descriptive Statistics

Premium

Descriptive statistics help data scientists analyze and summarize high-level observations within a data set prior to preprocessing the data. Interviewers expect you to understand the key definitions in this lesson, as well as how to interpret each statistic measure in sample data.

What to expect

Example questions include:

  • In a positively skewed distribution, which measure of central tendency would likely be larger: the mean or the median? Why?
  • You are analyzing the distribution of customer ages for an e-commerce platform. The dataset contains the ages of 5000 randomly selected customers with the mean age = 45 years, median = 30, mode = 28. Interpret the results in the context of customer demographics. Discuss how each measure of central tendency can inform marketing strategies for different age groups.
  • How does standard deviation differ from variance, and when would you prefer to use one over the other?
  • How would you interpret a correlation coefficient of -0.8 between two variables in a dataset?

This lesson will discuss:

  • Skewness and kurtosis
  • Measures of central tendency
  • Measures of variability

For each topic, we’ll provide a brief description of the topic and its function in real-world data science scenarios.

Skewness and kurtosis

Skewness and kurtosis are valuable measures for understanding the distribution of data and assessing its suitability for analysis and modeling in data science applications.

Many statistical models make assumptions about the distribution of the error terms rather than the raw data itself.

For example, in linear regression, the classical assumption is that the residuals are normally distributed with constant variance, which is important for valid hypothesis testing and confidence intervals. Deviations from these assumptions may affect the reliability of statistical inference.

Skewness

Skewness measures the asymmetry of the probability distribution of a dataset around its mean. It quantifies the degree to which the data are skewed to the left or right of the mean.

A distribution is positively skewed (right-skewed) when the tail on the right side of the distribution is longer or fatter than the tail on the left side. This means that the distribution has a longer right tail and is stretched out towards higher values. In a positively skewed distribution, the mean is typically greater than the median, and the distribution is "skewed" towards the lower end.

Household income tends to be right-skewed, meaning that there are relatively few households with very high incomes compared to the majority of households with lower incomes.

Conversely, a distribution is negatively skewed (left-skewed) when the tail on the left side is longer or fatter than the tail on the right side. This indicates a longer left tail and a concentration of data towards higher values. In a negatively skewed distribution, the mean is typically less than the median, and the distribution is "skewed" towards the higher end.

The distribution of response times for requests by a high-performance server in a tech company, where the majority of requests are processed rapidly, tend to be left-skewed

Kurtosis

Kurtosis measures the "tailedness" of a probability distribution, indicating how sharply or heavily the data are concentrated around the mean compared to a normal distribution (mesokurtic).

Positive kurtosis indicates a distribution with heavier tails and a higher peak than a normal distribution ("heavy-tailed" distribution). High kurtosis values indicate that the data have heavy tails and are more prone to extreme values or outliers.

Negative kurtosis indicates a distribution with lighter tails and a lower peak than a normal distribution ( "light-tailed" distribution). Low kurtosis values indicate that the data have light tails and are less prone to extreme values.

In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level of risk for an investment because it indicates that there are high probabilities of extremely large and extremely small returns, whereas a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low.

Skewness & Kurtosis

Measures of central tendency

Data scientists use measures of central tendency to quickly summarize and understand data distributions. They are easy to interpret and communicate to business stakeholders.

Mean

The mean, also known as the average, is sensitive to extreme values (outliers) in the data, as it takes into account the magnitude of all values.

The mean is widely used in various statistical analyses and modeling techniques, such as linear regression, where it represents the expected value of a continuous variable.

A specific use case of the mean is to analyze overall employee performance ratings. The mean identifies trends, areas of improvement, and high-performing employees.

Median

The median is the middle value in a dataset when the values are sorted in ascending or descending order. It is less sensitive to outliers compared to the mean, making it a robust measure of central tendency, particularly for skewed distributions.

The median is often used to describe the central value of a distribution when extreme values or skewness are present, providing a better representation of the typical value.

In real estate, property prices can vary widely, with a few high-value or low-value properties skewing the average. The median is less affected by extreme values, making it a more robust measure of central tendency for real estate pricing analysis.

Mode

The mode is the most frequently occurring value in a dataset. Unlike the mean and median, the mode can be used for both categorical and numerical data.

The mode is useful for identifying the most common category or value in a dataset and is often used to describe the central tendency of categorical variables, such as preferences or types.

In retail, identifying the mode, or the most frequently occurring product in sales transactions, helps in identifying best-selling items.

Mean vs. Median vs. Mode

Measures of variability

Standard deviation

Standard deviation measures the spread or dispersion of data points around the mean.

Standard deviation formula

s=1n1i=1n(xixˉ)2s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}

Where

  • nn = the sample size
  • xix_i = each individual data point
  • xˉ\bar{x} = sample mean.

A larger standard deviation indicates greater variability, while a smaller standard deviation indicates less variability. High variability may indicate inconsistency, noise, or measurement error in the data, while low variability suggests greater consistency and precision.

If the data are normally distributed, approximately 68% of the data points fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

Variance

The variance is the average of the squared differences between each data point and the mean. The standard deviation (σ for a population and s for a sample) is calculated as the square root of the variance.

Range

Range is a simple statistical measure that represents the difference between the maximum and minimum values in a dataset.

A larger range indicates greater variability in the dataset, while a smaller range suggests less variability. Additionally, interquartile range is a common method used in outlier detection.

Correlation vs. covariance

Correlation and covariance help data scientists understand how changes in one variable are associated with changes in another, allowing for the identification of patterns, trends, and dependencies in the data.

The key difference between correlation and covariance is that correlation is a standardized measure, whereas covariance is not and depends on the scales of XX and YY. As a result, while covariance indicates the direction of the linear relationship between two variables, correlation measures both the direction and strength.

Mathematically, the covariance between two variables XX and YY is calculated as the average of the products of the deviations of each variable from their respective means:

Cov(X,Y)=i=1n(xixˉ)(yiyˉ)n\text{Cov}(X,Y) = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{n}

Where

  • xix_i and yiy_i : individual data points
  • xˉ\bar{x} and yˉ\bar{y} : means of XX and YY
  • nn : number of data points

The sign of covariance (positive, negative, or zero) indicates the direction of the linear relationship between the variables:

  • Positive covariance: X tends to be larger when Y is larger.
  • Negative covariance: X tends to be smaller when Y is larger.
  • Zero covariance: X and Y are independent.

Correlation removes the scale dependency of covariance, and thus measures both the direction and strength of the linear relationship of two variables.

The correlation coefficient rr is calculated as the covariance between two variables divided by the product of their standard deviations:

r=Cov(X,Y)sXsYr=\frac{\text{Cov}(X,Y)}{s_X\cdot s_Y}

Where

  • sXs_X and sYs_Y : standard deviations of XX and YY, respectively

Correlation rr ranges from -1 to 1:

  • rr=1: Perfect positive linear relationship.
  • rr=−1: Perfect negative linear relationship.
  • rr=0: No linear relationship.

While covariance indicates the direction of the linear relationship between variables, correlation additionally measures the strength and standardizes the relationship, making it more useful for comparisons across different datasets.