6.2 Descriptive Statistics

What is data telling us?
Was is the quality of our dataset?

Descriptive Statistics
Stat PRICE HEIGHT WIDTH SIGNED PICTURE HOUSE
Min. 0.01041 3.90 6.70 0.0000 1.00 1.000
1st Quantil 0.60015 23.12 28.52 1.0000 87.25 1.000
Median 1.31278 25.60 31.90 1.0000 179.50 2.000
Mean 3.09000 27.65 32.11 0.8209 182.64 1.612
3rd Quantil 3.85000 31.45 36.20 1.0000 274.75 2.000
Max. 33.01350 78.70 89.00 1.0000 387.00 3.000


where:

  • Min.: the minimum value
  • 1st Qu.: The first quartile. 25% of values are lower than this.
  • Median: the median value. Half the values are lower; half are higher.
  • 3rd Qu.: the third quartile. 75% of values are higher than this.
  • Max.: the maximum value


What does the mean value of the variable PICTURE is telling us?


6.2.1 Data Summary and Presentation

Well-constructed data summaries and displays are essential to good statistical thinking because they focus the analyst on important features of the data or provide insight about the type of analysis/model that should be used in problem solution.


Mean and Median

The mean and the median are summary measures used to describe the most “typical” value in a set of values.

The difference between the mean and median can be illustrated with an example. Suppose we draw a sample of five teenage boys and measure their weights. They weigh 100 pounds, 100 pounds, 130 pounds, 140 pounds, and 150 pounds.

  • To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values. Thus, in the sample of five boys, the median value would be 130 pounds; since 130 pounds is the middle weight.

  • The mean of a sample or a population is computed by adding all of the observations and dividing by the number of observations. Returning to the example of the five teenage boys, the mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.

If the \(n\) observations in a sample are denoted by \(x_1,x_2, \ldots, x_n\), the sample mean is

\[ \bar{x} = \dfrac{x_1 + x_2 + \ldots + x_n}{n} = \dfrac{\sum_{i=1}^n x_i}{n}\]

As measures of central tendency, the mean and the median each have advantages and disadvantages. Some pros and cons of each measure are:

  • The median may be a better indicator of the most typical value if a set of scores has an outlier. An outlier is an extreme value that differs greatly from other values.

  • However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency.

To illustrate these points, consider the following example. Suppose we examine a sample of 10 households to estimate the typical family income. Nine of the households have incomes between 20,000 USD and 100,000 USD; but the tenth household has an annual income of 1,000,000,000 USD. That tenth household is an outlier. If we choose a measure to estimate the income of a typical household, the mean will greatly over-estimate the income of a typical family (because of the outlier); while the median will not.

Source


How to Measure Variability

The most common measures of variability are the range, the interquartile range (IQR), variance, and standard deviation.

  • The range is the difference between the largest and smallest values in a set of values. For example, consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11. For this set of numbers, the range would be 11 - 1 or 10.

  • The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.

Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by \(Q_1\), \(Q_2\), and \(Q_3\), respectively.

  • \(Q_1\) is the “middle” value in the first half of the rank-ordered data set.
  • \(Q_2\) is the median value in the set.
  • \(Q_3\) is the “middle” value in the second half of the rank-ordered data set.

The interquartile range is equal to \(Q_3\) minus \(Q_1\). For example, consider the following numbers: 1, 2, 3, 4, 5, 6, 7, 8.

\(Q_2\) is the median of the entire data set - the middle value. In this example, we have an even number of data points, so the median is equal to the average of the two middle values. Thus, \(Q_2 = (4 + 5)/2\) or \(Q_2 = 4.5\). \(Q_1\) is the middle value in the first half of the data set. Since there are an even number of data points in the first half of the data set, the middle value is the average of the two middle values; that is, \(Q_1 = (2 + 3)/2\) or \(Q_1 = 2.5\). \(Q_3\) is the middle value in the second half of the data set. Again, since the second half of the data set has an even number of observations, the middle value is the average of the two middle values; that is, \(Q_3 = (6 + 7)/2\) or \(Q_3 = 6.5\). The interquartile range is \(Q_3\) minus \(Q_1\), so \(IQR = 6.5 - 2.5 = 4\).

  • Variance is the average squared deviation from the population mean,

  • The standard deviation is the square root of the variance.

If the \(n\) observations in a sample are denoted by \(x_1,x_2, \ldots, x_n\), the sample variance is

\[ s^2 = \dfrac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\] The sample standard deviation, \(s\), is the positive square root of the sample variance.

Source