13 Beyond Basic Numerics and Statistics

Introduction

This chapter presents a few advanced techniques such as those you might encounter in the first or second year of a graduate program in applied statistics.

Most of these recipes use functions available in the base distribution. Through add-on packages, R provides some of the world’s most advanced statistical techniques. This is because researchers in statistics now use R as their lingua franca, showcasing their newest work. Anyone looking for a cutting-edge statistical technique is urged to search CRAN and the web for possible implementations.

13.1 Minimizing or Maximizing a Single-Parameter Function

Problem

Given a single-parameter function f, you want to find the point at which f reaches its minimum or maximum.

Solution

To minimize a single-parameter function, use optimize. Specify the function to be minimized and the bounds for its domain (x):

If you instead want to maximize the function, specify maximum = TRUE:

Discussion

The optimize function can handle functions of one argument. It requires upper and lower bounds for x that delimit the region to be searched. The following example finds the minimum of a polynomial, 3x4 – 2x3 + 3x2 – 4x + 5:

The returned value is a list with two elements: minimum, the x value that minimizes the function; and objective, the value of the function at that point.

A tighter range for lower and upper means a smaller region to be searched and hence a faster optimization. However, if you are unsure of the appropriate bounds, then use big but reasonable values such as lower = -1000 and upper = 1000. Just be careful that your function does not have multiple minima within that range! The optimize function will find and return only one such minimum.

13.2 Minimizing or Maximizing a Multiparameter Function

Problem

Given a multiparameter function f, you want to find the point at which f reaches its minimum or maximum.

Solution

To minimize a multiparameter function, use optim. You must specify the starting point, which is a vector of initial arguments for f:

To maximize the function instead, specify this control parameter:

Discussion

The optim function is more general than optimize (see Recipe 13.1, “Minimizing or Maximizing a Single-Parameter Function”) because it handles multiparameter functions. To evaluate your function at a point, optim packs the point’s coordinates into a vector and calls your function with that vector. The function should return a scalar value. optim will begin at your starting point and move through the parameter space, searching for the function’s minimum.

Here is an example of using optim to fit a nonlinear model. Suppose you believe that the paired observations z and x are related by zi = (xi + α)β + εi, where α and β are unknown parameters and where the εi are nonnormal noise terms. Let’s fit the model by minimizing a robust metric, the sum of the absolute deviations:

∑|z – (x + a)b|

First we define the function to be minimized. Note that the function has only one formal parameter, a two-element vector. The actual parameters to be evaluated, a and b, are packed into the vector in locations 1 and 2:

The following code makes a call to optim, starts from (1, 1), and searches for the minimum point of f:

The returned list includes convergence, the indicator of success or failure. If this indicator is 0, then optim found a minimum; otherwise, it did not. Obviously, the convergence indicator is the most important returned value because other values are meaningless if the algorithm did not converge.

The returned list also includes par, the parameters that minimize our function, and value, the value of f at that point. In this case, optim did converge and found a minimum point at approximately a = 10.0 and b = 0.7.

There are no lower and upper bounds for optim, just the starting point that you provide. A better guess for the starting point means a faster minimization.

The optim function supports several different minimization algorithms, and you can select among them. If the default algorithm does not work for you, see the help page for alternatives. A typical problem with multidimensional minimization is that the algorithm gets stuck at a local minimum and fails to find a deeper, global minimum. Generally speaking, the algorithms that are more powerful are less likely to get stuck. However, there is a trade-off: they also tend to run more slowly.

See Also

The R community has implemented many tools for optimization. On CRAN, see the task view for Optimization and Mathematical Programming for more solutions.

13.3 Calculating Eigenvalues and Eigenvectors

Problem

You want to calculate the eigenvalues or eigenvectors of a matrix.

Solution

Use the eigen function. It returns a list with two elements, values and vectors, which contain (respectively) the eigenvalues and eigenvectors.

Discussion

Suppose we have a matrix such as the Fibonacci matrix:

Given the matrix, the eigen function will return a list of its eigenvalues and eigenvectors:

Use either eigen(fibmat)$values or eigen(fibmat)$vectors to select the needed value from the list.

13.4 Performing Principal Component Analysis

Problem

You want to identify the principal components of a multivariable dataset.

Solution

Use the prcomp function. The first argument is a formula whose righthand side is the set of variables, separated by plus signs (+). The lefthand side is empty:

Discussion

The base software for R includes two functions for principal component analysis (PCA), prcomp and princomp. The documentation mentions that prcomp has better numerical properties, so that’s the function presented here.

An important use of PCA is to reduce the dimensionality of your dataset. Suppose your data contains a large number N of variables. Ideally, all the variables are more or less independent and contributing equally. But if you suspect that some variables are redundant, the PCA can tell you the number of sources of variance in your data. If that number is near N, then all the variables are useful. If the number is less than N, then your data can be reduced to a dataset of smaller dimensionality.

The PCA recasts your data into a vector space where the first dimension captures the most variance, the second dimension captures the second most, and so forth. The actual output from prcomp is an object that, when printed, gives the needed vector rotation:

We typically find the summary of PCA much more useful. It shows the proportion of variance that is captured by each component:

In this example, the first component captured 85% of the variance and the second component only 15%, so we know the first component captured most of it.

After calling prcomp, use plot(r) to view a bar chart of the variances of the principal components and predict(r) to rotate your data to the principal components.

See Also

See Recipe 13.9, “Factor Analysis” for an example of using principal component analysis. Further uses of PCA in R are discussed in Modern Applied Statistics with S-Plus by W. N. Venables and B. D. Ripley (Springer).

13.5 Performing Simple Orthogonal Regression

Problem

You want to create a linear model using orthogonal regression in which the variances of x and y are treated symmetrically.

Solution

Use prcomp to perform PCA on x and y. From the resulting rotation, compute the slope and intercept:

Discussion

Orthogonal regression is also known as total least squares (TLS).

The ordinary least squares (OLS) algorithm has an odd property: it is asymmetric. That is, calculating lm(y ~ x) is not the mathematical inverse of calculating lm(x ~ y). The reason is that OLS assumes the x values to be constants and the y values to be random variables, so all the variance is attributed to y and none is attributed to x. This creates an asymmetric situation.

Ordinary least squares versus orthogonal regression

Figure 13.1: Ordinary least squares versus orthogonal regression

The asymmetry is illustrated in Figure 13.1 where the upper-left panel displays the fit for lm(y ~ x). The OLS algorithm tries the minimize the vertical distances, which are shown as dotted lines. The upper-right panel shows the identical dataset but fit with lm(x ~ y) instead, so the algorithm is minimizing the horizontal dotted lines. Obviously, you’ll get a different result depending upon which distances are minimized.

The lower panel of Figure 13.1 is quite different. It uses PCA to implement orthogonal regression. Now the distances being minimized are the orthogonal distances from the data points to the regression line. This is a symmetric situation: reversing the roles of x and y does not change the distances to be minimized.

Implementing a basic orthogonal regression in R is quite simple. First, perform the PCA:

Next, use the rotations to compute the slope:

and then, from the slope, calculate the intercept:

We call this a “basic” regression because it yields only the point estimates for the slope and intercept, not the confidence intervals. Obviously, we’d like to have the regression statistics, too. Recipe 13.8, “Bootstrapping a Statistic”, shows one way to estimate the confidence intervals using a bootstrap algorithm.

See Also

Principal component analysis is described in Recipe 13.4, “Performing Principal Component Analysis”. The graphics in this recipe were inspired by the work of Vincent Zoonekynd and his tutorial on regression.

13.6 Finding Clusters in Your Data

Problem

You believe your data contains clusters: groups of points that are “near” each other. You want to identify those clusters.

Solution

Your dataset, x, can be a vector, data frame, or matrix. Assume that n is the number of clusters you desire:

The result, clust, is a vector of numbers between 1 and n, one for each observation in x. Each number classifies its corresponding observation into one of the n clusters.

Discussion

The dist function computes distances between all the observations. The default is Euclidean distance, which works well for many applications, but other distance measures are also available.

The hclust function uses those distances to form the observations into a hierarchical tree of clusters. You can plot the result of hclust to create a visualization of the hierarchy, called a dendrogram, as shown in Figure 13.2.

Finally, cutree extracts clusters from that tree. You must specify either how many clusters you want or the height at which the tree should be cut. Often the number of clusters is unknown, in which case you will need to explore the dataset for clustering that makes sense in your application.

We’ll illustrate clustering of a synthetic dataset. We start by generating 99 normal variates, each with a randomly selected mean of either –3, 0, or +3:

For our own curiosity, we can compute the true means of the original clusters. (In a real situation, we would not have the means factor and would be unable to perform this computation.) We can confirm that the groups’ means are pretty close to –3, 0, and +3:

To “discover” the clusters, we first compute the distances between all points:

Then we create the hierarchical clusters:

And we can plot the hierarchical cluster dendrogram by calling plot on the hc object:

Hierarchical cluster dendrogram

Figure 13.2: Hierarchical cluster dendrogram

Figure 13.2 allows us to now extract the three largest clusters. Obviously, we have a huge advantage here because we know the true number of clusters. Real life is rarely that easy. However, even if we didn’t already know we were dealing with three clusters, looking at the dendrogram gives us a good clue that there are three big clusters in the data.

clust is a vector of integers between 1 and 3, one integer for each observation in the sample, that assigns each observation to a cluster. Here are the first 20 cluster assignments:

By treating the cluster number as a factor, we can compute the mean of each statistical cluster (see Recipe 6.6), “Applying a Function to Groups of Data”):

R did a good job of splitting data into clusters: the means appear distinct, with one near –2.7, one near 0.27, and one near +3.2. (The order of the extracted means does not necessarily match the order of the original groups, of course.) The extracted means are similar but not identical to the original means. Side-by-side boxplots can show why:

Cluster boxplots

Figure 13.3: Cluster boxplots

The boxplots are shown in Figure 13.3. The clustering algorithm perfectly separated the data into nonoverlapping groups. The original clusters overlapped, whereas the identified clusters do not.

This illustration used one-dimensional data, but the dist function works equally well on multidimensional data stored in a data frame or matrix. Each row in the data frame or matrix is treated as one observation in a multidimensional space, and dist computes the distances between those observations.

See Also

This demonstration is based on the clustering features of the base package. There are other packages, such as mclust, that offer alternative clustering mechanisms.

13.7 Predicting a Binary-Valued Variable (Logistic Regression)

Problem

You want to perform logistic regression, a regression model that predicts the probability of a binary event occurring.

Solution

Call the glm function with family = binomial to perform logistic regression. The result is a model object:

Here, b is a factor with two levels (e.g., TRUE and FALSE, 0 and 1), while x1, x2, and x3 are predictor variables.

Use the model object, m, and the predict function to predict a probability from new data:

Discussion

Predicting a binary-valued outcome is a common problem in modeling. Will a treatment be effective or not? Will prices rise or fall? Who will win the game, team A or team B? Logistic regression is useful for modeling these situations. In the true spirit of statistics, it does not simply give a “thumbs up” or “thumbs down” answer; rather, it computes a probability for each of the two possible outcomes.

In the call to predict, we set type = "response" so that predict returns a probability. Otherwise, it returns log-odds, which most of us don’t find intuitive.

In his unpublished book entitled Practical Regression and ANOVA Using R, Julian Faraway gives an example of predicting a binary-valued variable: test from the dataset pima is true if the patient tested positive for diabetes. The predictors are diastolic blood pressure and body mass index (BMI). Faraway uses linear regression, so let’s try logistic regression instead:

The summary of the resulting model, m, shows that the respective p-values for the diastolic and the bmi variables are 0.8 and (essentially) zero. We can therefore conclude that only the bmi variable is significant:

Because only the bmi variable is significant, we can create a reduced model like this:

Let’s use the model to calculate the probability that someone with an average BMI (32.0) will test positive for diabetes:

According to this model, the probability is about 33.3%. The same calculation for someone in the 90th percentile gives a probability of 54.9%:

See Also

Using logistic regression involves interpreting the deviance to judge the significance of the model. We suggest you review a text on logistic regression before attempting to draw any conclusions from your regression.

13.8 Bootstrapping a Statistic

Problem

You have a dataset and a function to calculate a statistic from that dataset. You want to estimate a confidence interval for the statistic.

Solution

Use the boot package. Apply the boot function to calculate bootstrap replicates of the statistic:

Here, data is your original dataset, which can be stored in either a vector or a data frame. The statistic function (bootfun in this case) should expect two arguments: data is the original dataset, and indices is a vector of integers that selects the bootstrap sample from data.

Next, use the boot.ci function to estimate a confidence interval from the replications:

Discussion

Anybody can calculate a statistic, but that’s just the point estimate. We want to take it to the next level: What is the confidence interval (CI)? For some statistics, we can calculate the CI analytically. The CI for a mean, for instance, is calculated by the t.test function. Unfortunately, that is the exception and not the rule. For most statistics, the mathematics is too tortuous or simply unknown, and there is no known closed-form calculation for the CI.

The bootstrap algorithm can estimate a CI even when no closed-form calculation is available. It works like this. The algorithm assumes that you have a sample of size N and a function to calculate the statistic:

  1. Randomly select N elements from the sample, sampling with replacement. That set of elements is called a bootstrap sample.

  2. Apply the function to the bootstrap sample to calculate the statistic. That value is called a bootstrap replication.

  3. Repeat steps 1 and 2 many times to yield many (typically thousands) of bootstrap replications.

  4. From the bootstrap replications, compute the confidence interval.

That last step may seem mysterious, but there are several algorithms for computing the CI. A simple one uses percentiles of the replications, such as taking the 2.5 percentile and the 97.5 percentile to form the 95% CI.

We’re huge fans of the bootstrap because we work daily with obscure statistics, it is important that we know their confidence interval, and there is definitely no known formula for their CI. The bootstrap gives us a good approximation.

Let’s work an example. In Recipe 13.4, “Performing Principal Component Analysis” we estimated the slope of a line using orthogonal regression. That gave us a point estimate, but how can we find the CI? First, we encapsulate the slope calculation within a function:

Notice that the function is careful to select the subset defined by indices and to compute the slope from that exact subset.

Next, we calculate 999 replications of the slope. Recall that we had two vectors, x and y, in the original recipe; here, we combine them into a data frame:

The choice of 999 replications is a good starting point. You can always repeat the bootstrap with more and see if the results change significantly.

The boot.ci function can estimate the CI from the replications. It implements several different algorithms, and the type argument selects which algorithms are performed. For each selected algorithm, boot.ci will return the resulting estimate:

Here we chose two algorithms, percentile and BCa, by setting type = c("perc","bca"). The two resulting estimates appear at the bottom under their names. Other algorithms are available; see the help page for boot.ci.

You will note that the two confidence intervals are slightly different: (1.068, 1.992) versus (1.086, 2.050). This is an uncomfortable but inevitable result of using two different algorithms. We don’t know any method for deciding which is better. If the selection is a critical issue, you will need to study the reference and understand the differences. In the meantime, our best advice is to be conservative and use the minimum lower bound and the maximum upper bound; in this case, that would be (1.068, 2.050).

By default, boot.ci estimates a 95% CI. You can change that via the conf argument, like this:

See Also

See Recipe 13.4, “Performing Principal Component Analysis” for the slope calculation. A good tutorial and reference for the bootstrap algorithm is An Introduction to the Bootstrap by Bradley Efron and Robert Tibshirani (Chapman & Hall/CRC).

13.9 Factor Analysis

Problem

You want to perform factor analysis on your dataset, usually to discover what your variables have in common.

Solution

Use the factanal function, which requires your dataset and your estimate of the number of factors:

The output includes n factors, showing the loadings of each input variable for each factor.

The output also includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the number of factors is too small and does not capture the full dimensionality of the dataset; a p-value exceeding 0.05 indicates that there are likely enough (or more than enough) factors.

Discussion

Factor analysis creates linear combinations of your variables, called factors, that abstract the variable’s underlying commonality. If your N variables are perfectly independent, then they have nothing in common and N factors are required to describe them. But to the extent that the variables have an underlying commonality, fewer factors capture most of the variance and so fewer than N factors are required.

For each factor and variable, we calculate the correlation between them, known as the loading. Variables with a high loading are well explained by the factor. We can square the loading to know what fraction of the variable’s total variance is explained by the factor.

Factor analysis is useful when it shows that a few factors capture most of the variance of your variables. Thus, it alerts you to redundancy in your data. In that case you can reduce your dataset by combining closely related variables or by eliminating redundant variables altogether.

A more subtle application of factor analysis is interpreting the factors to find interrelationships between your variables. If two variables both have large loadings for the same factor, then you know they have something in common. What is it? There is no mechanical answer. You’ll need to study the data and its meaning.

There are two tricky aspects of factor analysis. The first is choosing the number of factors. Fortunately, you can use PCA to get a good initial estimate of the number of factors. The second tricky aspect is interpreting the factors themselves.

Let’s illustrate factor analysis by using stock prices or, more precisely, changes in stock prices. The dataset contains six months of price changes for the stocks of 12 companies. Every company is involved in the petroleum and gasoline industry. Their stock prices probably move together, since they are subject to similar economic and market forces. We might ask: How many factors are required to explain their changes? If only one factor is required, then all the stocks are the same and one is as good as another. If many factors are required, we know that owning several of them provides diversification.

We start by doing a PCA on diffs, the data frame of price changes. Plotting the PCA results shows the variance captured by the components:

PCA results plot

Figure 13.4: PCA results plot

We can see in Figure 13.4 that the first component captures much of the variance, but we don’t know if more components are required. So we perform the initial factor analysis while assuming that two factors are required:

We can ignore most of the output because the p-value at the bottom is very close to zero (.00184). The small p-value indicates that two factors are insufficient, so the analysis isn’t good. More are required, so we try again with three factors instead:

The large p-value (0.218) confirms that three factors are sufficient, so we can use the analysis.

The output includes a table of explained variance, shown here:

               Factor1 Factor2 Factor3
SS loadings      2.814   1.774   0.951
Proportion Var   0.256   0.161   0.086
Cumulative Var   0.256   0.417   0.504

This table shows that the proportion of variance explained by each factor is 0.256, 0.161, and 0.086, respectively. Cumulatively, they explain 0.504 of the variance, which leaves 1 – 0.504 = 0.496 unexplained.

Next we want to interpret the factors, which is more like voodoo than science. Let’s look at the loadings, repeated here:

Loadings:
    Factor1 Factor2 Factor3
APC  0.747   0.270   0.230 
BP   0.298   0.459   0.224 
BRY                  0.123 
CVX  0.442   0.672   0.197 
HES  0.589   0.299   0.434 
MRO  0.703   0.350   0.167 
NBL  0.760   0.249   0.124 
OXY  0.592   0.357         
ETP  0.194           0.489 
VLO  0.198   0.264   0.535 
XOM  0.355   0.753   0.190

Each row is labeled with the variable name (stock symbol): APC, BP, BRY, and so forth. The first factor has many large loadings, indicating that it explains the variance of many stocks. This is a common phenomenon in factor analysis. We are often looking at related variables, and the first factor captures their most basic relationship. In this example, we are dealing with stocks, and most stocks move together in concert with the broad market. That’s probably captured by the first factor.

The second factor is more subtle. Notice that the loadings for CVX (0.67) and XOM (0.75) are the dominant ones, with BP not far behind (0.46), but all other stocks have noticeably smaller loadings. This indicates a connection between CVX, XOM, and BP. Perhaps they operate together in a common market (e.g., multinational energy) and so tend to move together.

The third factor also has three dominant loadings: VLO, ETP, and HES. These are somewhat smaller companies than the global giants we see in the first factor. Possibly these three share similar markets or risks and so their stocks also tend to move together.

In summary, it seems there are three groups of stocks here:

  • CVX, XOM, BP

  • VLO, ETP, HES

  • Everything else

Factor analysis is an art and a science. We suggest that you read a good book on multivariate analysis before employing it.

See Also

See Recipe 13.4, “Performing Principal Component Analysis”, for more about PCA.