9 General Statistics
Introduction
Any significant application of R includes statistics or models or graphics. This chapter addresses the statistics. Some recipes simply describe how to calculate a statistic, such as relative frequency. Most recipes involve statistical tests or confidence intervals. The statistical tests let you choose between two competing hypotheses; that paradigm is described next. Confidence intervals reflect the likely range of a population parameter and are calculated based on your data sample.
Null Hypotheses, Alternative Hypotheses, and p-Values
Many of the statistical tests in this chapter use a time-tested paradigm of statistical inference. In the paradigm, we have one or two data samples. We also have two competing hypotheses, either of which could reasonably be true.
One hypothesis, called the null hypothesis, is that nothing happened: the mean was unchanged; the treatment had no effect; you got the expected answer; the model did not improve; and so forth.
The other hypothesis, called the alternative hypothesis, is that something happened: the mean rose; the treatment improved the patients’ health; you got an unexpected answer; the model fit better; and so forth.
We want to determine which hypothesis is more likely in light of the data:
To begin, we assume that the null hypothesis is true.
We calculate a test statistic. It could be something simple, such as the mean of the sample, or it could be quite complex. The critical requirement is that we must know the statistic’s distribution. We might know the distribution of the sample mean, for example, by invoking the Central Limit Theorem.
From the statistic and its distribution we can calculate a p-value, the probability of a test statistic value as extreme or more extreme than the one we observed, while assuming that the null hypothesis is true.
If the p-value is too small, we have strong evidence against the null hypothesis. This is called rejecting the null hypothesis.
If the p-value is not small, then we have no such evidence. This is called failing to reject the null hypothesis.
There is one necessary decision here: When is a p-value “too small”?
Note
In this book, we follow the common convention that we reject the null hypothesis when p < 0.05 and fail to reject it when p > 0.05. In statistical terminology, we chose a significance level of α = 0.05 to define the border between strong evidence and insufficient evidence against the null hypothesis.
But the real answer is, “it depends.” Your chosen significance level depends on your problem domain. The conventional limit of p < 0.05 works for many problems. In our work, the data are especially noisy and so we are often satisfied with p < 0.10. For someone working in high-risk areas, p < 0.01 or p < 0.001 might be necessary.
In the recipes, we mention which tests include a p-value so that you can compare the p-value against your chosen significance level of α. We worded the recipes to help you interpret the comparison. Here is the wording from Recipe 9.4, Testing Categorical Variables for Independence, a test for the independence of two factors:
Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent whereas a p-value exceeding 0.05 fails to provide any such evidence.
This is a compact way of saying:
The null hypothesis is that the variables are independent.
The alternative hypothesis is that the variables are not independent.
For α = 0.05, if p < 0.05 then we reject the null hypothesis, giving strong evidence that the variables are not independent; if p > 0.05, we fail to reject the null hypothesis.
You are free to choose your own α, of course, in which case your decision to reject or fail to reject might be different.
Remember, the recipe states the informal interpretation of the test results, not the rigorous mathematical interpretation. We use colloquial language in the hope that it will guide you toward a practical understanding and application of the test. If the precise semantics of hypothesis testing is critical for your work, we urge you to consult the reference cited under See Also or one of the other fine textbooks on mathematical statistics.
Confidence Intervals
Hypothesis testing is a well-understood mathematical procedure, but it can be frustrating. First, the semantics is tricky. The test does not reach a definite, useful conclusion. You might get strong evidence against the null hypothesis, but that’s all you’ll get. Second, it does not give you a number, only evidence.
If you want numbers then use confidence intervals, which bound the estimate of a population parameter at a given level of confidence. Recipes in this chapter can calculate confidence intervals for means, medians, and proportions of a population.
For example, Recipe 9.9, “Forming a Confidence Interval for a Mean”, calculates a 95% confidence interval for the population mean based on sample data. The interval is 97.16 < μ < 103.98, which means there is a 95% probability that the population’s mean, μ, is between 97.16 and 103.98.
See Also
Statistical terminology and conventions can vary. This book generally follows the conventions of Mathematical Statistics with Applications, 6th ed., by Dennis Wackerly et al. (Duxbury Press). We recommend this book also for learning more about the statistical tests described in this chapter.
9.1 Summarizing Your Data
Problem
You want a basic statistical summary of your data.
Solution
The summary
function gives some useful statistics for vectors,
matrices, factors, and data frames:
Discussion
The Solution exhibits the summary of a vector. The 1st Qu.
and 3rd Qu.
are the first and third quartile,
respectively. Having both the median and mean is useful because you can
quickly detect skew. The preceding Solution, for example, shows a mean that is
larger than the median; this indicates a possible skew to the right, as one would expect from a lognormal distribution.
The summary of a matrix works column by column. Here we see the summary
of a matrix, mat
, with three columns named Samp1
, Samp2
, and Samp3
:
summary(mat)
#> Samp1 Samp2 Samp3
#> Min. : 1.0 Min. :-2.943 Min. : 0.04
#> 1st Qu.: 25.8 1st Qu.:-0.774 1st Qu.: 0.39
#> Median : 50.5 Median :-0.052 Median : 0.85
#> Mean : 50.5 Mean :-0.067 Mean : 1.60
#> 3rd Qu.: 75.2 3rd Qu.: 0.684 3rd Qu.: 2.12
#> Max. :100.0 Max. : 2.150 Max. :13.18
The summary of a factor gives counts:
The summary of a character vector is pretty useless, just the vector length:
The summary of a data frame incorporates all these features. It works column by column, giving an appropriate summary according to the column type. Numeric values receive a statistical summary and factors are counted (character strings are not summarized):
suburbs <- read_csv("./data/suburbs.txt")
summary(suburbs)
#> city county state
#> Length:17 Length:17 Length:17
#> Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character
#>
#>
#>
#> pop
#> Min. : 5428
#> 1st Qu.: 72616
#> Median : 83048
#> Mean : 249770
#> 3rd Qu.: 102746
#> Max. :2853114
The “summary” of a list is pretty funky: just the data type of each list member.
Here is a summary
of a list of vectors:
summary(vec_list)
#> Length Class Mode
#> x 100 -none- numeric
#> y 100 -none- numeric
#> z 100 -none- character
To summarize the data inside a list of vectors, map summary
to each list element:
library(purrr)
map(vec_list, summary)
#> $x
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -2.572 -0.583 0.006 0.018 0.710 2.413
#>
#> $y
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -1.772 -0.740 0.024 0.027 0.597 2.293
#>
#> $z
#> Length Class Mode
#> 100 character character
Unfortunately, the summary
function does not compute any measure of
variability, such as standard deviation or median absolute deviation.
This is a serious shortcoming, so we usually call sd
or mad
(mean absolute deviation) right after calling summary
.
See Also
See Recipes 2.6, “Computing Basic Statistics”, and 6.1, “Applying a Function to Each List Element”.
9.2 Calculating Relative Frequencies
Problem
You want to count the relative frequency of certain observations in your sample.
Solution
Identify the interesting observations by using a logical expression;
then use the mean
function to calculate the fraction of observations
it identifies. For example, given a vector x, you can find the
relative frequency of positive values in this way:
Discussion
A logical expression, such as x > 3, produces a vector of logical
values (TRUE
and FALSE
), one for each element of x. The mean
function converts those values to 1s and 0s, respectively, and computes
the average. This gives the fraction of values that are TRUE
—in other
words, the relative frequency of the interesting values. In the
Solution, for example, that’s the relative frequency of
values greater than 3.
The concept here is pretty simple. The tricky part is dreaming up a suitable logical expression. Here are some examples:
mean(lab == "NJ")
Fraction of
lab
values that are New Jerseymean(after > before)
Fraction of observations for which the effect increases
mean(abs(x-mean(*x*)) > 2*sd(*x*))
Fraction of observations that exceed two standard deviations from the mean
mean(diff(ts) > 0)
Fraction of observations in a time series that are larger than the previous observation
9.3 Tabulating Factors and Creating Contingency Tables
Problem
You want to tabulate one factor or build a contingency table from multiple factors.
Solution
The table
function produces counts of one factor:
It can also produce contingency tables (cross-tabulations) from two or more factors:
table
works for characters, too, not only factors:
Discussion
The table
function counts the levels of one factor or characters, such as these
counts of initial
and outcome
(which are factors):
set.seed(42)
initial <- factor(sample(c("Yes", "No", "Maybe"), 100, replace = TRUE))
outcome <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))
table(initial)
#> initial
#> Maybe No Yes
#> 21 39 40
table(outcome)
#> outcome
#> Fail Pass
#> 56 44
The greater power of table
is in producing contingency tables, also
known as cross-tabulations. Each cell in a contingency table counts how
many times that row/column combination occurred:
This table shows that the combination of initial = Yes
and
outcome = Fail
occurred 13 times, the combination of initial = Yes
and outcome = Pass
occurred 17 times, and so forth.
See Also
The xtabs
function can also produce a contingency table. It has a
formula interface, which some people prefer.
9.4 Testing Categorical Variables for Independence
Problem
You have two categorical variables that are represented by factors. You want to test them for independence using the chi-squared test.
Solution
Use the table
function to produce a contingency table from the two
factors. Then use the summary
function to perform a chi-squared test
of the contingency table. In this example we have two vectors of factor values, which we created in the prior recipe:
summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#> Chisq = 0.03, df = 2, p-value = 1
The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent, whereas a p-value exceeding 0.05 fails to provide any such evidence.
Discussion
This example performs a chi-squared test on the contingency table from Recipe 9.3, “Tabulating Factors and Creating Contingency Tables”, and yields a p-value of 0.2:
summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#> Chisq = 0.03, df = 2, p-value = 1
The large p-value indicates that the two factors, initial
and
outcome
, are probably independent. Practically speaking, we
conclude there is no connection between the variables. This makes sense, as this example data was created by simply drawing random data using the sample
function in the prior recipe.
See Also
The chisq.test
function can also perform this test.
9.5 Calculating Quantiles (and Quartiles) of a Dataset
Problem
Given a fraction f, you want to know the corresponding quantile of your data. That is, you seek the observation x such that the fraction of observations below x is f.
Solution
Use the quantile
function. The second argument is the fraction, f:
For quartiles, simply omit the second argument altogether:
Discussion
Suppose vec
contains 1,000 observations between 0 and 1. The
quantile
function can tell you which observation delimits the lower 5%
of the data:
The quantile
documentation refers to the second argument as a
“probability,” which is natural when we think of probability as meaning
relative frequency.
In true R style, the second argument can be a vector of probabilities;
in this case, quantile
returns a vector of corresponding quantiles,
one for each probability:
That is a handy way to identify the middle 90% (in this case) of the observations.
If you omit the probabilities altogether, then R assumes you want the probabilities 0, 0.25, 0.50, 0.75, and 1.0—in other words, the quartiles:
Amazingly, the quantile
function implements nine (yes, nine) different
algorithms for computing quantiles. Study the help page before assuming
that the default algorithm is the best one for you.
9.6 Inverting a Quantile
Problem
Given an observation x from your data, you want to know its corresponding quantile. That is, you want to know what fraction of the data is less than x.
Solution
Assuming your data is in a vector vec
, compare the data against the
observation and then use mean
to compute the relative frequency of
values less than x—say, 1.6 as per this example.
Discussion
The expression vec < *x*
compares every element of vec
against *x*
and
returns a vector of logical values, where the nth logical value is
TRUE
if vec[*n*] < *x*
. The mean
function converts those logical
values to 0
and 1
: 0
for FALSE
and 1
for TRUE
. The average of all
those 1
s and 0
s is the fraction of vec
that is less than *x*
, or the
inverse quantile of *x*
.
See Also
This is an application of the general approach described in Recipe 9.2, “Calculating Relative Frequencies”.
9.7 Converting Data to z-Scores
Problem
You have a dataset, and you want to calculate the corresponding z-scores for all data elements. (This is sometimes called normalizing the data.)
Solution
Use the scale
function:
scale(x)
#> [,1]
#> [1,] 2.2638
#> [2,] -0.0463
#> [3,] -0.9400
#> [4,] -0.7755
#> [5,] 1.0020
#> [6,] 0.0650
#> [7,] -1.0227
#> [8,] -0.5412
#> [9,] 0.1408
#> [10,] -0.1458
#> attr(,"scaled:center")
#> [1] 2.53
#> attr(,"scaled:scale")
#> [1] 2.15
This works for vectors, matrices, and data frames. In the case of a
vector, scale
returns the vector of normalized values. In the case of
matrices and data frames, scale
normalizes each column independently
and returns columns of normalized values in a matrix.
Discussion
You might also want to normalize a single value y relative to a dataset x. You can do so by using vectorized operations as follows:
9.8 Testing the Mean of a Sample (t Test)
Problem
You have a sample from a population. Given this sample, you want to know if the mean of the population could reasonably be a particular value m.
Solution
Apply the t.test
function to the sample x with the argument mu = m
:
The output includes a p-value. Conventionally, if p < 0.05 then the population mean is unlikely to be m, whereas p > 0.05 provides no such evidence.
If your sample size n is small, then the underlying population must be normally distributed in order to derive meaningful results from the t test. A good rule of thumb is that “small” means n < 30.
Discussion
The t test is a workhorse of statistics, and this is one of its basic
uses: making inferences about a population mean from a sample. The
following example simulates sampling from a normal population with mean
μ = 100. It uses the t test to ask if the population mean could be
95, and t.test
reports a p-value of 0.005:
x <- rnorm(75, mean = 100, sd = 15)
t.test(x, mu = 95)
#>
#> One Sample t-test
#>
#> data: x
#> t = 3, df = 74, p-value = 0.005
#> alternative hypothesis: true mean is not equal to 95
#> 95 percent confidence interval:
#> 96.5 103.0
#> sample estimates:
#> mean of x
#> 99.7
The p-value is small and so it’s unlikely (based on the sample data) that 95 could be the mean of the population.
Informally, we could interpret the low p-value as follows. If the population mean were really 95, then the probability of observing our test statistic (t = 2.8898 or something more extreme) would be only 0.005. That is very improbable, yet that is the value we observed. Hence we conclude that the null hypothesis is wrong; therefore, the sample data does not support the claim that the population mean is 95.
In sharp contrast, testing for a mean of 100 gives a p-value of 0.9:
t.test(x, mu = 100)
#>
#> One Sample t-test
#>
#> data: x
#> t = -0.2, df = 74, p-value = 0.9
#> alternative hypothesis: true mean is not equal to 100
#> 95 percent confidence interval:
#> 96.5 103.0
#> sample estimates:
#> mean of x
#> 99.7
The large p-value indicates that the sample is consistent with assuming a population mean μ of 100. In statistical terms, the data does not provide evidence against the true mean being 100.
A common case is testing for a mean of zero. If you omit the mu
argument, it defaults to zero.
See Also
The t.test
function is a many-splendored thing. See Recipes 9.9, “Forming a Confidence Interval for a Mean”, and 9.15,
“Comparing the Means of Two Samples”, for other uses.
9.9 Forming a Confidence Interval for a Mean
Problem
You have a sample from a population. Given that sample, you want to determine a confidence interval for the population’s mean.
Solution
Apply the t.test
function to your sample x:
The output includes a confidence interval at the 95% confidence level.
To see intervals at other levels, use the conf.level
argument.
As in Recipe 9.8, “Testing the Mean of a Sample”, if your sample size n is small, then the underlying population must be normally distributed for there to be a meaningful confidence interval. Again, a good rule of thumb is that “small” means n < 30.
Discussion
Applying the t.test
function to a vector yields a lot of output.
Buried in the output is a confidence interval:
t.test(x)
#>
#> One Sample t-test
#>
#> data: x
#> t = 53, df = 49, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#> 94.2 101.5
#> sample estimates:
#> mean of x
#> 97.9
In this example, the confidence interval is approximately 94.2 < μ < 101.5, which is sometimes written simply as (94.2, 101.5).
We can raise the confidence level to 99% by setting conf.level=0.99
:
t.test(x, conf.level = 0.99)
#>
#> One Sample t-test
#>
#> data: x
#> t = 53, df = 49, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 99 percent confidence interval:
#> 92.9 102.8
#> sample estimates:
#> mean of x
#> 97.9
That change widens the confidence interval to 92.9 < μ < 102.8.
9.10 Forming a Confidence Interval for a Median
Problem
You have a data sample, and you want to know the confidence interval for the median.
Solution
Use the wilcox.test
function, setting conf.int=TRUE
:
The output will contain a confidence interval for the median.
Discussion
The procedure for calculating the confidence interval of a mean is well defined and widely known. The same is not true for the median, unfortunately. There are several procedures for calculating the median’s confidence interval. None of them is “the” procedure, but the Wilcoxon signed rank test is pretty standard.
The wilcox.test
function implements that procedure. Buried in the
output is the 95% confidence interval, which is approximately (–0.102,
0.646) in this case:
wilcox.test(x, conf.int = TRUE)
#>
#> Wilcoxon signed rank test
#>
#> data: x
#> V = 220, p-value = 0.1
#> alternative hypothesis: true location is not equal to 0
#> 95 percent confidence interval:
#> -0.102 0.646
#> sample estimates:
#> (pseudo)median
#> 0.311
You can change the confidence level by setting conf.level
, such as
conf.level=0.99
or other such values.
The output also includes something called the pseudomedian, which is defined on the help page. Don’t assume it equals the median; they are different:
See Also
The bootstrap procedure is also useful for estimating the median’s confidence interval; see Recipes 8.5, “Generating a Random Sample” and 13.8, “Bootstrapping a Statistic”.
9.11 Testing a Sample Proportion
Problem
You have a sample of values from a population consisting of successes and failures. You believe the true proportion of successes is p, and you want to test that hypothesis using the sample data.
Solution
Use the prop.test
function. Suppose the sample size is n and the
sample contains x successes:
The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the true proportion is unlikely to be p, whereas a p-value exceeding 0.05 fails to provide such evidence.
Discussion
Suppose you encounter some loudmouthed fan of the Chicago Cubs early in the baseball season. The Cubs have played 20 games and won 11 of them, or 55% of their games. Based on that evidence, the fan is “very confident” that the Cubs will win more than half of their games this year. Should he be that confident?
The prop.test
function can evaluate the fan’s logic. Here, the number
of observations is n = 20, the number of successes is x = 11, and
p is the true probability of winning a game. We want to know whether
it is reasonable to conclude, based on the data, that p > 0.5.
Normally, prop.test
would check for p ≠ 0.05, but we can check for
p > 0.5 instead by setting alternative="greater"
:
prop.test(11, 20, 0.5, alternative = "greater")
#>
#> 1-sample proportions test with continuity correction
#>
#> data: 11 out of 20, null probability 0.5
#> X-squared = 0.05, df = 1, p-value = 0.4
#> alternative hypothesis: true p is greater than 0.5
#> 95 percent confidence interval:
#> 0.35 1.00
#> sample estimates:
#> p
#> 0.55
The prop.test
output shows a large p-value, 0.55, so we cannot
reject the null hypothesis; that is, we cannot reasonably conclude that
p is greater than 1/2. The Cubs fan is being overly confident based on
too little data. No surprise there.
9.12 Forming a Confidence Interval for a Proportion
Problem
You have a sample of values from a population consisting of successes and failures. Based on the sample data, you want to form a confidence interval for the population’s proportion of successes.
Solution
Use the prop.test
function. Suppose the sample size is n and the
sample contains x successes:
The function output includes the confidence interval for p.
Discussion
We subscribe to a stock market newsletter that is well written, but includes a section purporting to identify stocks that are likely to rise. It does this by looking for a certain pattern in the stock price. It recently reported, for example, that a certain stock was following the pattern. It also reported that the stock rose six times after the last nine times that pattern occurred. The writers concluded that the probability of the stock rising again was therefore 6/9, or 66.7%.
Using prop.test
, we can obtain the confidence interval for the true
proportion of times the stock rises after the pattern. Here, the number
of observations is n = 9 and the number of successes is x = 6. The
output shows a confidence interval of (0.309, 0.910) at the 95%
confidence level:
prop.test(6, 9)
#> Warning in prop.test(6, 9): Chi-squared approximation may be incorrect
#>
#> 1-sample proportions test with continuity correction
#>
#> data: 6 out of 9, null probability 0.5
#> X-squared = 0.4, df = 1, p-value = 0.5
#> alternative hypothesis: true p is not equal to 0.5
#> 95 percent confidence interval:
#> 0.309 0.910
#> sample estimates:
#> p
#> 0.667
The writers are pretty foolish to say the probability of rising is 66.7%. They could be leading their readers into a very bad bet.
By default, prop.test
calculates a confidence interval at the 95%
confidence level. Use the conf.level
argument for other confidence
levels:
See Also
See Recipe 9.11, “Testing a Sample Proportion”.
9.13 Testing for Normality
Problem
You want a statistical test to determine whether your data sample is from a normally distributed population.
Solution
Use the shapiro.test
function:
The output includes a p-value. Conventionally, p < 0.05 indicates that the population is likely not normally distributed, whereas p > 0.05 provides no such evidence.
Discussion
This example reports a p-value of .7765 for x:
The large p-value suggests the underlying population could be normally distributed. The next example reports a very small p-value for y, so it is unlikely that this sample came from a normal population:
We have highlighted the Shapiro–Wilk test because it is a standard R
function. You can also install the package nortest
, which is dedicated
entirely to tests for normality. This package includes:
Anderson–Darling test (
ad.test
)Cramer–von Mises test (
cvm.test
)Lilliefors test (
lillie.test
)Pearson chi-squared test for the composite hypothesis of normality (
pearson.test
)Shapiro–Francia test (
sf.test
)
The problem with all these tests is their null hypothesis: they all assume that the population is normally distributed until proven otherwise. As a result, the population must be decidedly non-normal before the test reports a small p-value and you can reject that null hypothesis. That makes the tests quite conservative, tending to err on the side of normality.
Instead of depending solely upon a statistical test, we suggest also using histograms (Recipe 10.19, “Creating a Histogram”) and quantile-quantile plots (Recipe 10.21, “Creating a Normal Quantile–Quantile Plot”) to evaluate the normality of any data. Are the tails too fat? Is the peak too peaked? Your judgment is likely better than a single statistical test.
See Also
See Recipe 3.10, “Installing Packages from CRAN”, for how to install the nortest
package.
9.14 Testing for Runs
Problem
Your data is a sequence of binary values: yes/no, 0/1, true/false, or other two-valued data. You want to know: Is the sequence random?
Solution
The tseries
package contains the runs.test
function, which checks a
sequence for randomness. The sequence should be a factor with two
levels:
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
The runs.test
function reports a p-value. Conventionally, a
p-value of less than 0.05 indicates that the sequence is likely not
random, whereas a p-value exceeding 0.05 provides no such evidence.
Discussion
A run is a subsequence composed of identical values, such as all 1
s or
all 0
s. A random sequence should be properly jumbled up, without too
many runs. Similarly, it shouldn’t contain too few runs, either. A
sequence of perfectly alternating values (0
, 1
, 0
, 1
, 0
, 1
, …)
contains no runs, but would you say that it’s random?
The runs.test
function checks the number of runs in your sequence. If
there are too many or too few, it reports a small p-value.
This first example generates a random sequence of 0
s and 1
s and then
tests the sequence for runs. Not surprisingly, runs.test
reports a
large p-value, indicating the sequence is likely random:
s <- sample(c(0, 1), 100, replace = T)
runs.test(as.factor(s))
#>
#> Runs Test
#>
#> data: as.factor(s)
#> Standard Normal = -0.2, p-value = 0.9
#> alternative hypothesis: two.sided
This next sequence, however, consists of three runs and so the reported p-value is quite low:
See Also
See Recipes 5.4, “Creating a Factor”, and 8.6, “Generating Random Sequences”.
9.15 Comparing the Means of Two Samples
Problem
You have one sample each from two populations. You want to know if the two populations could have the same mean.
Solution
Perform a t test by calling the t.test
function:
By default, t.test
assumes that your data are not paired. If the
observations are paired (i.e., if each xi is paired with one
yi), then specify paired=TRUE
:
In either case, t.test
will compute a p-value. Conventionally, if
p < 0.05 then the means are likely different, whereas p > 0.05
provides no such evidence:
If either sample size is small, then the populations must be normally distributed. Here, “small” means fewer than 20 data points.
If the two populations have the same variance, specify
var.equal=TRUE
to obtain a less conservative test.
Discussion
We often use the t test to get a quick sense of the difference between two population means. It requires that the samples be large enough (i.e., both samples have 20 or more observations) or that the underlying populations be normally distributed. We don’t take the “normally distributed” part too literally. Being bell-shaped and reasonably symmetrical should be good enough.
A key distinction here is whether or not your data contains paired observations, since the results may differ in the two cases. Suppose we want to know if coffee in the morning improves scores on SATs. We could run the experiment two ways:
Randomly select one group of people. Give them the SAT twice, once with morning coffee and once without morning coffee. For each person, we will have two SAT scores. These are paired observations.
Randomly select two groups of people. One group has a cup of morning coffee and takes the SAT. The other group just takes the test. We have a score for each person, but the scores are not paired in any way.
Statistically, these experiments are quite different. In experiment 1, there are two observations for each person (caffeinated and decaf) and they are not statistically independent. In experiment 2, the data are independent.
If you have paired observations (experiment 1) and erroneously analyze them as unpaired observations (experiment 2), then you could get this result with a p-value of 0.3:
load("./data/sat.rdata")
t.test(x, y)
#>
#> Welch Two Sample t-test
#>
#> data: x and y
#> t = -1, df = 198, p-value = 0.3
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -46.4 16.2
#> sample estimates:
#> mean of x mean of y
#> 1054 1069
The large p-value forces you to conclude there is no difference between the groups. Contrast that result with the one that follows from analyzing the same data but correctly identifying it as paired:
t.test(x, y, paired = TRUE)
#>
#> Paired t-test
#>
#> data: x and y
#> t = -18, df = 99, p-value <2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -16.8 -13.5
#> sample estimates:
#> mean of the differences
#> -15.1
The p-value plummets to 2e-16, and we reach the exactly opposite conclusion.
See Also
If the populations are not normally distributed (bell-shaped) and either sample is small, consider using the Wilcoxon–Mann–Whitney test described in Recipe 9.16, “Comparing the Locations of Two Samples Nonparametrically”.
9.16 Comparing the Locations of Two Samples Nonparametrically
Problem
You have samples from two populations. You don’t know the distribution of the populations, but you know they have similar shapes. You want to know: Is one population shifted to the left or right compared with the other?
Solution
You can use a nonparametric test, the Wilcoxon–Mann–Whitney test, which
is implemented by the wilcox.test
function. For paired observations
(every xi is paired with yi), set paired=TRUE
:
For unpaired observations, let paired
default to FALSE
:
The test output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the second population is likely shifted left or right with respect to the first population, whereas a p-value exceeding 0.05 provides no such evidence.
Discussion
When we stop making assumptions regarding the distributions of populations, we enter the world of nonparametric statistics. The Wilcoxon–Mann–Whitney test is nonparametric and so can be applied to more datasets than the t test, which requires that the data be normally distributed (for small samples). This test’s only assumption is that the two populations have the same shape.
In this recipe, we are asking: Is the second population shifted left or right with respect to the first? This is similar to asking whether the average of the second population is smaller or larger than the first. However, the Wilcoxon–Mann–Whitney test answers a different question: it tells us whether the central locations of the two populations are significantly different or, equivalently, whether their relative frequencies are different.
Suppose we randomly select a group of employees and ask each one to complete the same task under two different circumstances: under favorable conditions and under unfavorable conditions, such as a noisy environment. We measure their completion times under both conditions, so we have two measurements for each employee. We want to know if the two times are significantly different, but we can’t assume they are normally distributed.
The data are paired, so we must set paired=TRUE
:
load(file = "./data/workers.rdata")
wilcox.test(fav, unfav, paired = TRUE)
#>
#> Wilcoxon signed rank test
#>
#> data: fav and unfav
#> V = 12, p-value = 1e-04
#> alternative hypothesis: true location shift is not equal to 0
The p-value is essentially zero. Statistically speaking, we reject the assumption that the completion times were equal. Practically speaking, it’s reasonable to conclude that the times were different.
In this example, setting paired=TRUE
is critical. Treating the data as
unpaired would be wrong because the observations are not independent;
and this, in turn, would produce bogus results. Running the example with
paired=FALSE
produces a p-value of 0.1022, which leads to the wrong
conclusion.
See Also
See Recipe 9.15, “Comparing the Means of Two Samples”, for the parametric test.
9.17 Testing a Correlation for Significance
Problem
You calculated the correlation between two variables, but you don’t know if the correlation is statistically significant.
Solution
The cor.test
function can calculate both the p-value and the
confidence interval of the correlation. If the variables came from
normally distributed populations then use the default measure of
correlation, which is the Pearson method:
For non-normal populations, use the Spearman method instead:
The function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant, whereas p > 0.05 indicates it is not.
Discussion
In our experience, people often fail to check a correlation for
significance. In fact, many people are unaware that a correlation can be
insignificant. They jam their data into a computer, calculate the
correlation, and blindly believe the result. However, they should ask
themselves: Was there enough data? Is the magnitude of the correlation
large enough? Fortunately, the cor.test
function answers those
questions.
Suppose we have two vectors, x and y, with values from normal populations. We might be very pleased that their correlation is greater than 0.75:
But that is naïve. If we run cor.test
, it reports a relatively large
p-value of 0.09:
cor.test(x, y)
#>
#> Pearson's product-moment correlation
#>
#> data: x and y
#> t = 2, df = 4, p-value = 0.09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.155 0.971
#> sample estimates:
#> cor
#> 0.751
The p-value is above the conventional threshold of 0.05, so we conclude that the correlation is unlikely to be significant.
You can also check the correlation by using the confidence interval. In this example, the confidence interval is (–0.155, 0.971). The interval contains zero and so it is possible that the correlation is zero, in which case there would be no correlation. Again, you could not be confident that the reported correlation is significant.
The cor.test
output also includes the point estimate reported by cor
(at the bottom, labeled “sample estimates”), saving you the additional
step of running cor
.
By default, cor.test
calculates the Pearson correlation, which assumes
that the underlying populations are normally distributed. The Spearman
method makes no such assumption because it is nonparametric. Use
method="Spearman"
when working with nonnormal data.
See Also
See Recipe 2.6, “Computing Basic Statistics”, for calculating simple correlations.
9.18 Testing Groups for Equal Proportions
Problem
You have samples from two or more groups. The group’s elements are binary-valued: either success or failure. You want to know if the groups have equal proportions of successes.
Solution
Use the prop.test
function with two vector arguments:
#>
#> 2-sample test for equality of proportions with continuity
#> correction
#>
#> data: ns out of nt
#> X-squared = 5, df = 1, p-value = 0.03
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#> -0.3058 -0.0142
#> sample estimates:
#> prop 1 prop 2
#> 0.48 0.64
These are parallel vectors. The first vector, ns
, gives the number of
successes in each group. The second vector, nt
, gives the size of the
corresponding group (often called the number of trials).
The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that it is likely the groups’ proportions are different, whereas a p-value exceeding 0.05 provides no such evidence.
Discussion
In Recipe 9.11, “Testing a Sample Proportion”, we tested a proportion based on one sample. Here, we have samples from several groups and want to compare the proportions in the underlying groups.
One of the authors recently taught statistics to 38 students and awarded a grade of A to 14 of them. A colleague taught the same class to 40 students and awarded an A to only 10. We wanted to know: Is the author fostering grade inflation by awarding significantly more A grades than the other teacher did?
We used prop.test
. “Success” means awarding an A, so the vector of
successes contains two elements: the number awarded by the author and the number
awarded by the colleague:
The number of trials is the number of students in the corresponding class:
The prop.test
output yields a p-value of 0.4:
prop.test(successes, trials)
#>
#> 2-sample test for equality of proportions with continuity
#> correction
#>
#> data: successes out of trials
#> X-squared = 0.8, df = 1, p-value = 0.4
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#> -0.111 0.348
#> sample estimates:
#> prop 1 prop 2
#> 0.368 0.250
The relatively large p-value means that we cannot reject the null hypothesis: the evidence does not suggest any difference between the teachers’ grading.
See Also
See Recipe 9.11, “Testing a Sample Proportion”.
9.19 Performing Pairwise Comparisons Between Group Means
Problem
You have several samples, and you want to perform a pairwise comparison between the sample means. That is, you want to compare the mean of every sample against the mean of every other sample.
Solution
Place all data into one vector and create a parallel factor to identify
the groups. Use pairwise.t.test
to perform the pairwise comparison of
means:
The output contains a table of p-values, one for each pair of groups. Conventionally, if p < 0.05 then the two groups likely have different means, whereas p > 0.05 provides no such evidence.
Discussion
This is more complicated than Recipe 9.15, “Comparing the Means of Two Samples”, where we compared the means of two samples. Here we have several samples and want to compare the mean of every sample against the mean of every other sample.
Statistically speaking, pairwise comparisons are tricky. It is not the
same as simply performing a t test on every possible pair. The
p-values must be adjusted, as otherwise you will get an overly
optimistic result. The help pages for pairwise.t.test
and p.adjust
describe the adjustment algorithms available in R. Anyone doing serious
pairwise comparisons is urged to review the help pages and consult a
good textbook on the subject.
Suppose we are using a larger sample of the data from Recipe 5.5, “Combining Multiple Vectors into One Vector and a Factor”, where we combined
data for freshmen, sophomores, and juniors into a data frame called
comb
. The data frame has two columns: the data in a column called
values
, and the grouping factor in a column called ind
. We can use
pairwise.t.test
to perform pairwise comparisons between the groups:
pairwise.t.test(comb$values, comb$ind)
#>
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: comb$values and comb$ind
#>
#> fresh soph
#> soph 0.001 -
#> jrs 3e-04 0.592
#>
#> P value adjustment method: holm
Notice the table of p-values. The comparisons of juniors versus freshmen and of sophomores versus freshmen produced small p-values: 0.0011 and 0.0003, respectively. We can conclude there are significant differences between those groups. However, the comparison of sophomores versus juniors produced a (relatively) large p-value of 0.592, so they are not significantly different.
See Also
See Recipes 5.5, “Combining Multiple Vectors into One Vector and a Factor” and 9.15, “Comparing the Means of Two Samples”.
9.20 Testing Two Samples for the Same Distribution
Problem
You have two samples, and you are wondering: Did they come from the same distribution?
Solution
The Kolmogorov–Smirnov test compares two samples and tests them for
being drawn from the same distribution. The ks.test
function
implements that test:
The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the two samples (x and y) were drawn from different distributions, whereas a p-value exceeding 0.05 provides no such evidence.
Discussion
The Kolmogorov–Smirnov test is wonderful for two reasons. First, it is a nonparametric test and so you needn’t make any assumptions regarding the underlying distributions: it works for all distributions. Second, it checks the location, dispersion, and shape of the populations, based on the samples. If these characteristics disagree then the test will detect that, allowing us to conclude that the underlying distributions are different.
Suppose we suspect that the vectors x and y come from differing
distributions. Here, ks.test
reports a p-value of 0.04:
#>
#> Two-sample Kolmogorov-Smirnov test
#>
#> data: x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided
ks.test(x, y)
#>
#> Two-sample Kolmogorov-Smirnov test
#>
#> data: x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided
From the small p-value we can conclude that the samples are from different distributions. However, when we test x against another sample, z, the p-value is much larger (0.6); this suggests that x and z could have the same underlying distribution: