Head First Statistics

Book details

Author: Dawn Griffiths
Year: 2008
Rating: 5/5

Basic statistics concepts

Frequency - the number of times a certain event has taken place in an experiment. It can simply be defined as the count of a certain event that has occurred.
Frequency density - the frequency per unit for the data in each class. It is calculated by dividing the frequency by the class width: $frequency density = \frac{Frequency}{Class width}$ For continuous random variables that can take a range of values, each with an associated probability of occurring, we can plot the values against their probabilities on a graph called the probability distribution.
Cumulative frequency - the total of the absolute frequencies of all events at or below a certain point in an ordered list of events.

Displaying your statistical data with charts and graphs

To choose the most suitable chart type to visualise your statistical data, you need to understand the types of data you are dealing with: categorical or numerical. Categorical data, also called qualitative data, describes characteristics of different categories - e.g. the breed of dogs. Numerical data presents the frequencies of numerical ranges or intervals. For example, how many players scored between 1-500 points or 500-800 points in an online video game last Saturday.

Chart types

Pie chart

Pie charts are great for showing the relative percentages of different categories that belong to the same data set, especially when the frequency differences between categories are large. In other words, the frequencies of the categories should all add up to 100%.

Bar chart

Used when the frequency differences between categories are small or when representing multiple sets of data per category. A horizontal bar chart is more readable when we have long category names. A segmented bar chart can be used to show the total frequency of each category as well as the composition or percentage within each category.

Histogram

A histogram is most suitable for charting grouped data. The height of each bar represents the frequency density of the group rather than its frequency itself; the frequency itself is represented by bar area.

Line chart

Line charts can be used to plot cumulative frequencies or to show trends in your data. You can use different lines to represent different sets of data, especially when making basic predictions, to easily spot the shape of the trend.

Line charts should only be used for numerical data, not categorical. When used for time measurements, line charts are often called time series charts.

Sometimes you can use either line charts or bar charts for the same data set. Line charts are better at showing trends, while bar charts are better for comparing values across categories.

Box and whiskers chart

A box and whiskers chart, also known as a box plot, is particularly useful for showing ranges, quartiles, and upper and lower bounds. A box shows where the quartiles are. The width or height of the box (depending on the orientation of the box plot) indicates the interquartile range. The whiskers give the upper and lower bounds, or sometimes custom-specified bounds such as $P_{5}$ and $P_{95}$ with extreme outliers shown as dots beyond the whiskers. More than one set of data can be shown on the same chart for easy comparison.

Central tendency

It can be difficult to make sense of a big pile of numbers and figures, and finding the average is often the simplest first step to see the big picture. There are three types of average in statistics: mean, median and mode.

Mean

Mean is the average most people are used to, where you add up all the numbers and then divide by the number of numbers. The mathematical representation of mean:

$μ = \frac{\sum _{x}}{n} = \frac{\sum _{f x}}{\sum _{f}}$

Median

Median is the value in the middle when you line up all the values in ascending order. If there is an even number of values, add the middle two together and divide by two.

When there are outliers in the dataset, your data distribution is no longer symmetrical. The outliers will either skew the graph to the left or to the right, and mean could be misleading in representing the average of the dataset. To mitigate the effect outliers have on mean, we use median to represent the average.

Mode

Mode is most useful when there are two or more clusters of data in the data set. The values with the highest frequencies are the modes. Mode is the only average that works with categorical data, while mean and median can only be used for numerical values. Unlike mean or median, the mode has to be in the data set.

Mean, mode and median characteristics

	Mean	Mode	Median
Easy to understand and calculate?	Yes	Yes	Sometimes
Uses all of the data values in a set?	Yes	No	No
Can be used for qualitative data?	No	Yes	No
Can be used for quantitative data?	Yes	Yes	Yes
Is affected by extreme values?	Yes	No	Rarely
Is one of the data values in the set?	Rarely	Yes	Sometimes
Provides only a single result?	Yes	Sometimes	Yes

Adapted from Willoughby, D. (2015). An Essential Guide to Business Statistics. Chichester: Wiley, p. 134.

Measures of dispersion: range, variance, standard deviation, quantiles and z-score

Range

The range is a way of measuring how spread out a set of values is. It only describes the width of the data, not how it’s dispersed between the bounds. Therefore range is extremely sensitive to outliers and can easily be used to give a misleading impression of the underlying data.

Quantiles

To reduce the impact outliers have on range, quantiles are values used to split data into quarters. The lowest quantile is called the lower quartile $Q_{1}$ , and the highest is called the upper quartile $Q_{3}$ . The middle value $Q_{2}$ is the median. The range of values between $Q_{1}$ and $Q_{3}$ is called the interquartile range (IQR) and it is less sensitive to outliers.

Percentiles

Percentile is similar to quantile. Instead of splitting data into quarters, percentiles split data into percentages. We use $P_{x}$ to denote percentiles. $P_{25} = Q_{1}$ , $P_{75} = Q_{3}$ , and $P_{50} = Q_{2} = median$ . Percentiles are useful for determining or benchmarking ranks or positions. They let you see how high a particular value is relative to all the others.

Variance

While range, quantile and percentile tell us the difference between high and low values, they don’t indicate the spread of the values within those bounds. To measure spread we use variance - the average of the squared distance of each value from the mean:

$variance = \frac{\sum _{(x - μ)^{2}}}{n} = \frac{\sum _{x^{2}}}{n} - μ^{2}$

Standard deviation

The square root of the variance is called the standard deviation, which gives the spread in terms of the distance from the mean, not the distance squared:

$σ = variance$

Standard score (z-score)

The standard score, or z-score, gives you a way of comparing values across different sets of data where the mean and standard deviation differ. It’s a way of comparing related values in different circumstances by converting values in a data set to a more generic distribution, where your data keeps the same basic shape. For example, you can use the z-score to compare two players’ performance against their personal track records to see who performed better (against their own ability) in the same game. To calculate the standard score of a particular value $x$ , you use:

$z = \frac{x - μ}{σ}$

Probability theory

Probability theory is about events - occurrences that have probabilities attached to them. It indicates how likely you are to get the desired outcome. An example would be tossing a coin and getting a head.

Probability is measured on a scale of 0 to 1. To find the probability of an event, you count how many ways there are to get the desired result, and then divide it by the number of all possible outcomes, also known as the possibility space or sample space:

$P (A) = \frac{n ( A )}{n ( S )}$

The probability of an event that is not $A$ is called the complementary event of $A$ , and is annotated as $A^{'}$ . Since $A^{'}$ covers every possibility that is not $A$ , we can say $P (A) + P (A^{'}) = 1$ .

Sometimes you can add probabilities together, but it doesn’t work in all circumstances. It depends on whether the two events are mutually exclusive.

A Venn diagram is a good tool to visualise the situation. If two events are not mutually exclusive, we say they intersect, and they follow:

$P (A \cup B) = P (A) + P (B) - P (A \cap B)$

Following this formula, if events $A$ and $B$ are mutually exclusive, $P (A \cap B) = 0$ and $P (A \cup B) = P (A) + P (B)$ .

If the probability of one event is influenced by another event occurring, we call it a conditional probability.

$P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}$

To work out conditional probabilities, we use a probability tree for visualisation. For example, to find $P (A \cap B)$ , we can multiply the probabilities for the two branches: $P (B) \times P (A ∣ B)$ , which is the same equation as above.

With a probability tree starting from $P (A)$ , we can work out $P (B)$ using the law of total probability:

$P (B) = P (A) \times P (B ∣ A) + P (A^{'}) \times P (B ∣ A^{'})$

Combining the above two formulae, we can find reverse conditional probabilities from a given probability tree. This is called Bayes’ Theorem:

$P (A ∣ B) = \frac{P ( A ) \times P ( B ∣ A )}{P ( A ) \times P ( B ∣ A ) + P ( A ^{'} ) \times P ( B ∣ A ^{'} )}$

If two events $A$ and $B$ are said to be independent, their respective outcomes have no effect on each other:

$P (A ∣ B) = P (A)$

If two events are independent, then (notice the difference compared to the law of total probability):

$P (A \cap B) = P (A) \times P (B)$

Expectations

A probability distribution tells us the probability of all possible outcomes of an event, but it doesn’t tell you the overall impact of these outcomes. In statistics, you can use expectations to work out the long-term impact of an event. For example: what is the expected average gain/loss when playing a slot machine?

To find the expectation $E (X)$ of a variable $X$ , you multiply each value $x$ by the probability of getting that value, and then sum the results:

$E (X) = \sum x P (X = x)$

Similar to a data set, variance and standard deviation can be calculated for the probability distribution of event $X$ :

$σ^{2} = Var (X) = \sum (x - μ)^{2} P (X = x) = E ((x - μ)^{2})$

$σ = Var (X)$

In other words, we can use the formulae above to answer questions like what is the expected average winnings from playing a slot machine and how far away are we likely to be from the expectation for every pull of the lever.

If the consequences or pay-offs of the outcomes change, we don’t have to recalculate the probability distribution to work out the new expectation. It follows these linear transformations:

$E (a X + b) = a E (X) + b$

$Var (a X + b) = a^{2} Var (X)$

The outcome of an event is called an observation. For independent observations - for example, playing two slot machines of the same type at the same time - the expectations of two distinct instances of the observation (which has the same underlying probability distribution) become:

$E (X_{1} + X_{2}) = 2 E (X)$

and the variance:

$Var (X_{1} + X_{2} + ... + X_{n}) = n Var (X)$

If $X$ and $Y$ are independent variables:

$E (X + Y) = E (X) + E (Y)$

$Var (X + Y) = Var (X) + Var (Y)$

The linear transformations we looked at earlier become:

$E (a X + bY) = a E (X) + b E (Y)$

$Var (a X + bY) = a^{2} Var (X) + b^{2} Var (Y)$

$E (a X - bY) = a E (X) - b E (Y)$

$Var (a X - bY) = a^{2} Var (X) + b^{2} Var (Y)$

Decision tree

A decision tree is a modelling method that combines two probability theory tools: the probability tree and expected value, to work out which strategy should be adopted among many options with different probabilities and pay-offs. Expected Monetary Value (EMV) is often used to determine the best course of action. Sensitivity analysis is a closely related tool. If a strategy is adopted due to a known probability $p$ of event $A$ , sensitivity analysis calculates at what value of $p$ another strategy will result in a better EMV, and therefore warrants a strategy switch.

Permutations and combinations

A permutation is the number of ways in which you can choose objects from a pool, where the order in which you choose them counts. It’s a lot more specific than a combination as you want to count the number of ways in which each position is filled.

$^{n} P_{r} = \frac{n !}{( n - r )!}$

A combination is the number of ways in which you can choose objects from a pool, without caring about the exact order in which you choose them. It’s a lot more general than a permutation as you don’t need to know how each position has been filled. It’s enough to know which objects have been chosen.

$^{n} C_{r} = \frac{n !}{r ! ( n - r )!}$

Probability distributions

Geometric distribution

The geometric distribution covers situations where:

You run a series of independent trials
Each trial can result in either a success or a failure, and the probability of success is the same for each trial
The main thing you’re interested in is how many trials are needed to get the first successful outcome

If the probability of success in a trial is $p$ and the probability of failure is $1 - p$ (which we’ll call $q$ ), we can work out the probability for any value $r$ (where $r$ is the number of trials needed to get the first success) using:

$P (X = r) = q^{r - 1} p$

$P (X = r)$ is at its highest when $r = 1$ , and it gets lower and lower as $r$ increases. This means that the probability of getting a success is highest for the first trial, and the mode of any geometric distribution is always 1, as this is the value with the highest probability. This may sound counterintuitive, but it’s most likely that only one attempt will be needed for a successful outcome.

$P (X > r)$ is the probability that more than $r$ trials will be needed to get the first successful outcome. For the number of trials needed for a success to be greater than $r$ , there must have been $r$ failures. We don’t need $p$ in this formula because we don’t need to know exactly which trial was successful, just that there must be more than $r$ trials:

$P (X > r) = q^{r}$

We can use this to find $P (X \leq r)$ , the probability that $r$ or fewer trials are needed for there to be a successful outcome:

$P (X \leq r) = 1 - P (X > r) = 1 - q^{r}$

If a variable $X$ follows a geometric distribution where the probability of success in a trial is $p$ , this can be written as:

$X \sim Geo (p)$

The expectation and variance of a geometric distribution can be expressed as:

$E (X) = \frac{1}{p}$

$Var (X) = \frac{q}{p ^{2}}$

Binomial distribution

Use the binomial distribution if you’re running a fixed number of independent trials, where each one can have a success or failure, the probability of success is the same for each trial, and you’re interested in the number of successes or failures. For example, answering 30 questions correctly out of 40 in the game Who Wants To Be A Millionaire.

The formula for the probability of getting $r$ successes out of $n$ trials is:

$P(X=r)=\,^nC_r \times p^r\times q^{n-r}$

Where $p$ is the probability of success in a trial, $q = 1 - p$ , $n$ is the number of trials, and $X$ is the number of successes in the $n$ trials.

If variable $X$ follows a binomial distribution, we can denote it as:

$X \sim B (n, p)$

The expectation and variance of a binomial distribution can be calculated using:

$E (X) = n p$

$Var (X) = n pq$

Approximating the binomial distribution with a normal distribution

If $X$ follows a binomial distribution $X \sim B (n, p)$ and $n p > 5$ and $n q > 5$ , you can use the normal distribution $X \sim N (n p, n pq)$ to approximate the binomial distribution. If you do so, you need to apply a continuity correction to make sure your results are accurate.

Approximating the binomial distribution with a Poisson distribution

A binomial distribution $X \sim B (n, p)$ can be approximated by a Poisson distribution $X \sim Po (n p)$ if $n$ is large (say over 50) and $p$ is small (say less than 0.1).

Poisson distribution

The Poisson distribution covers situations where:

Individual events occur at random and independently in a given interval. This can be an interval of time or space - for example, during a week, or per mile.
You know the mean number of occurrences in the interval or the rate of occurrences, and it’s finite. The mean number of occurrences is normally represented by the Greek letter $λ$ (lambda).

We can use the variable $X$ to represent the number of occurrences in the given interval, for instance the number of breakdowns in a week. If $X$ follows a Poisson distribution with a mean of $λ$ occurrences per interval (or rate), we write this as:

$X \sim Po (λ)$

To find the probability that there are $r$ occurrences in a specific interval, use the formula:

$P (X = r) = \frac{e ^{- λ} λ ^{r}}{r !}$

The variance of the Poisson distribution is the same as the expectation:

$E (X) = Var (X) = λ$

Approximating the Poisson distribution with a normal distribution

If $X$ follows the Poisson distribution $X \sim Po (λ)$ and $λ > 15$ , you can approximate $X$ using the normal distribution $X \sim N (λ, λ)$ . Don’t forget to apply a continuity correction when approximating a Poisson distribution with the normal distribution.

Normal distribution

For discrete probability distributions, we look at the probability of getting a particular value; for continuous probability distributions, we look at the probability of getting a particular range.

For continuous random variables, probabilities are given by area. To find the probability of getting a particular range of values, we start by sketching the probability density function. The probability of getting a particular range of values is given by the area under the line between those values.

The normal distribution, also known as the Gaussian distribution, is the most-used distribution model in statistics. It is an ‘ideal’ model for continuous data. It’s what you’d ‘normally’ expect to see in real life for a lot of continuous data.

The normal distribution is defined by two parameters, $μ$ and $σ^{2}$ . $μ$ tells you where the centre of the curve is, and $σ$ gives you the spread. If variable $X$ follows a normal distribution, we say:

$X \sim N (μ, σ^{2})$

When plotted on a coordinate plane, the normal distribution shows as a symmetrical bell curve with infinite tails on each side. $μ$ determines the location of the distribution and $σ$ defines the dispersion - the ‘broadness’ of the curve.

A normal distribution is called the standard normal distribution when $μ = 0$ and $σ = 1$ . You can use the standard score (z-score) to transform a regular normal distribution into a standard normal distribution. This process is called standardisation:

$z = \frac{X - μ}{σ}$

You can look up the distribution table or use Excel to work out the probabilities of $Z$ once you have transformed your normal distribution into a standard normal distribution.

With the help of the rules of linear transformation of expectations, if $X \sim N (μ_{x}, σ_{x}^{2})$ and $Y \sim N (μ_{y}, σ_{y}^{2})$ , and $X$ and $Y$ are independent, then:

$X + Y \sim N (μ_{x} + μ_{y}, σ_{x}^{2} + σ_{y}^{2})$

$X - Y \sim N (μ_{x} - μ_{y}, σ_{x}^{2} + σ_{y}^{2})$

If $X \sim N (μ, σ^{2})$ and $a$ and $b$ are numbers, then:

$a X + b \sim N (a μ + b, a^{2} σ^{2})$

If $X_{1}, X_{2}, ..., X_{n}$ are independent observations of $X$ where $X \sim N (μ, σ^{2})$ :

$X_{1} + X_{2} + ... + X_{n} \sim N (n μ, n σ^{2})$

Predictions with samples

Sampling

A statistical population refers to the entire group of things that you’re trying to measure, study, or analyse. It can refer to anything from humans to scores to gumballs. The key thing is that a population refers to all of them. A closely related concept is the target population, which means the group that you’re researching and want to collect results for. The target population you choose depends, to a large extent, on the purpose of your study. For example, do you want to gather data about all the gumballs in the world, one particular brand, or one particular type?

A census is a study or survey involving the entire population. A census can provide you with accurate information about your population, but it’s not always practical, especially when populations are very large or infinite.

A statistical sample is a selection of items taken from a population. You choose your sample so that it’s fairly representative of the population as a whole; it’s a subset of the population. To take a sample, start by defining your target population - the population you want to study. Then decide on your sampling units - the sorts of things you need to sample. Once you’ve done that, draw up a sampling frame: a list of all the sampling units in your target population. If you don’t design your sample well, your sample may not be accurate. A poorly designed sample can introduce biases.

A study or survey involving just a sample of the population is called a sample survey. A lot of the time, conducting a survey is more practical than a census. It’s usually less time-consuming and less expensive, as you don’t have to deal with the entire population.

A sample unit refers to the sort of object you are going to sample. For example, this could be a single gumball or a packet of gumballs. Once you have a list of all the sampling units within your target population, preferably with each sampling unit either named or numbered, you have the sampling frame. The sampling frame is essentially a list from which you can choose your sample.

A sample is biased if it isn’t representative of your target population. There are lots of sources of bias, and a lot of it comes down to how you choose your sample:

A sampling frame where items have been left off, such that not everything in the target population is included. If it’s not in your sampling frame, it won’t be in your sample.
An incorrect sampling unit. Instead of individual gumballs, maybe the sampling unit should have been boxes of gumballs.
Individual sampling units you chose for your sample weren’t included in your actual sample. For example, you might send out a questionnaire that not everybody responds to.
Poorly designed questions in a questionnaire. Design your questions so that they’re neutral and everyone can answer them. An example of a biased question is “Mighty Gumball candy is tastier than any other brand, do you agree?” It would be better to ask the person being surveyed for the name of their favourite brand of confectionery.
Samples that aren’t random. For example, if you’re conducting a survey on the street, you may avoid questioning anyone who looks too busy to stop, or too aggressive. This means you exclude aggressive or busy-looking people from your survey.

Here are a few ways we can choose our samples:

Simple random sampling is where you choose sampling units at random to form your sample. This can be with or without replacement. You can perform simple random sampling by drawing lots or using random number generators. Sampling with replacement means that when you’ve selected each unit and recorded relevant information about it, you put it back into the population.
Stratified sampling is where you divide the population into groups of similar units, or strata (each individual group is called a stratum). Each stratum is as different from the others as possible. Once you’ve done this, you perform simple random sampling within each stratum.
Cluster sampling is where you divide the population into clusters where each cluster is as similar to the others as possible. You use simple random sampling to choose a selection of clusters. You then sample every unit in these clusters. The problem with cluster sampling is that it might not be entirely random. For example, it’s likely that all of the gumballs in a packet will have been produced by the same factory. If there are differences between the factories, you may not pick these up.
Systematic sampling is where you choose a number, $k$ , and sample every $k_{t h}$ unit. The disadvantage of systematic sampling is that if there’s some sort of cyclic pattern in the population, your sample will be biased. As an example, if gumballs are produced such that every $k$ th gumball is red, you will end up only sampling red gumballs, which could lead to misleading conclusions about your population.

Predictions with samples

It is not always feasible to calculate the exact value of the population parameters when the population is enormous. So instead of calculating them using the population, we estimate them using the sample data. To do this, we use point estimators to come up with a best guess of the population parameters.

We use $\overset{x}{ˉ}$ to denote the sample mean. $μ$ has a very precise meaning - the population mean. $\overset{x}{ˉ}$ is the sample equivalent of $μ$ :

$\overset{x}{ˉ} = \frac{\sum _{x}}{n}$

We can use this relationship to write a shorthand expression for the point estimator for the population. Since we can estimate the population mean using the mean of the sample, this means that:

$\overset{μ}{^} = \overset{x}{ˉ}$

When you choose a sample, you have a smaller number of values than with the population, and since you have fewer values, there’s a good chance they’re more clustered around the mean than they would be in the population. More extreme values are less likely to be in your sample, as there are generally fewer of them.

To find a closer match to the value of the population variance, we use:

$\overset{σ}{^}^{2} = \frac{\sum ( x - x ˉ ) ^{2}}{n - 1}$

Dividing a set of numbers by $n - 1$ gives a higher result than dividing by $n$ , and this difference is most noticeable when $n$ is fairly small. It means that the formula is similar to the variance of the sample data but gives a slightly higher result.

The population variance tends to be higher than the variance of the data in the sample. This means that the formula above is a slightly better point estimator for the population variance.

T-distribution

The t-distribution is used when the sample size is small, the population is normal, and the population variance is unknown (you need to estimate it with sample variance).

References

Griffiths, D. (2008). Head First Statistics. Sebastopol, CA: O’Reilly.
Willoughby, D. (2015). An Essential Guide to Business Statistics. Chichester: Wiley.

Recent Notes

Explorer

Head First Statistics

Book details

Basic statistics concepts

Displaying your statistical data with charts and graphs

Chart types

Pie chart

Bar chart

Histogram

Line chart

Box and whiskers chart

Central tendency

Mean

Median

Mode

Mean, mode and median characteristics

Measures of dispersion: range, variance, standard deviation, quantiles and z-score

Range

Quantiles

Percentiles

Variance

Standard deviation

Standard score (z-score)

Probability theory

Expectations

Decision tree

Permutations and combinations

Probability distributions

Geometric distribution

Binomial distribution

Approximating the binomial distribution with a normal distribution

Approximating the binomial distribution with a Poisson distribution

Poisson distribution

Approximating the Poisson distribution with a normal distribution

Normal distribution

Predictions with samples

Sampling

Predictions with samples

T-distribution

References

Graph View

Recent Notes

Table of Contents