7 Strings and Dates
Introduction
Strings? Dates? In a statistical programming package?
As soon as you read files or print reports, you need strings. When you work with real-world problems, you need dates.
R has facilities for both strings and dates. They are clumsy compared to string-oriented languages such as Perl, but then it’s a matter of the right tool for the job. We wouldn’t want to perform logistic regression in Perl.
Some of this clunkyness with strings and dates has been improved through the tidyverse packages stringr
and lubridate
. As with other chapters in this book, the examples here will pull from both Base R as well as add-on packages that make life easier, faster, and more convenient.
Classes for Dates and Times
R has a variety of classes for working with dates and times, which is nice if you prefer having a choice but annoying if you prefer living simply. There is a critical distinction among the classes: some are date-only classes, some are datetime classes. All classes can handle calendar dates (e.g., March 15, 2019), but not all can represent a datetime (11:45 AM on March 1, 2019).
The following classes are included in the base distribution of R:
Date
The
Date
class can represent a calendar date but not a clock time. It is a solid, general-purpose class for working with dates, including conversions, formatting, basic date arithmetic, and time-zone handling. Most of the date-related recipes in this book are built on theDate
class.POSIXct
This is a datetime class, and it can represent a moment in time with an accuracy of one second. Internally, the datetime is stored as the number of seconds since January 1, 1970, and so it’s a very compact representation. This class is recommended for storing datetime information (e.g., in data frames).
POSIXlt
This is also a datetime class, but the representation is stored in a nine-element list that includes the year, month, day, hour, minute, and second. This representation makes it easy to extract date parts, such as the month or hour. Obviously, this is much less compact than the
POSIXct
class; hence, it is normally used for intermediate processing and not for storing data.
The base distribution also provides functions for easily converting
between representations: as.Date
, as.POSIXct
, and as.POSIXlt
.
The following helpful packages are available for downloading from CRAN:
chron
The
chron
package can represent both dates and times but without the added complexities of handling time zones and Daylight Saving Time. It’s therefore easier to use thanDate
but less powerful thanPOSIXct
andPOSIXlt
. It would be useful for work in econometrics or time series analysis.lubridate
This is a tidyverse package designed to make working with dates and times easier while keeping the important bells and whistles such as time zones. It’s especially clever regarding datetime arithmetic. This package introduces some helpful constructs like durations, periods, and intervals.
lubridate
is part of the tidyverse, so it is installed when youinstall.packages('tidyverse')
but it is not part of “core tidyverse,” so it does not get loaded when you runlibrary(tidyverse)
. This means you must explicitly load it by runninglibrary(lubridate)
.mondate
This is a specialized package for handling dates in units of months in addition to days and years. It can be helpful in accounting and actuarial work, for example, where month-by-month calculations are needed.
timeDate
This is a high-powered package with well-thought-out facilities for handling dates and times, including date arithmetic, business days, holidays, conversions, and generalized handling of time zones. It was originally part of the
Rmetrics
software for financial modeling, where precision in dates and times is critical. If you have a demanding need for date facilities, consider this package.
Which class should you select? The article “Date and Time Classes in R” by Gabor Grothendieck and Thomas Petzoldt offers this general advice:
When considering which class to use, always choose the least complex class that will support the application. That is, use
Date
if possible, otherwise usechron
and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
See Also
See help(DateTimeClasses)
for more details regarding the built-in
facilities. See the June 2004 article
“Date and Time Classes in R”
by Gabor Grothendieck and Thomas Petzoldt for a great introduction to the date
and time facilities.
The June 2001 article
“Date-Time Classes”
by Brian Ripley and Kurt Hornik discusses the two POSIX classes in particular.
Chapter 16, “Dates and Times”, from the book R for Data Science by Garrett Grolemund and Hadley Wickham provides a great intro to lubridate
.
7.1 Getting the Length of a String
Problem
You want to know the length of a string.
Solution
Use the nchar
function, not the length
function.
Discussion
The nchar
function takes a string and returns the number of characters
in the string:
If you apply nchar
to a vector of strings, it returns the length of
each string:
You might think the length
function returns the length of a string.
Nope. It returns the length of a vector. When you apply the length
function to a single string, R returns the value 1
because it views that
string as a singleton vector—a vector with one element:
7.2 Concatenating Strings
Problem
You want to join together two or more strings into one string.
Solution
Use the paste
function.
Discussion
The paste
function concatenates several strings together. In other
words, it creates a new string by joining the given strings end to end:
By default, paste
inserts a single space between pairs of strings,
which is handy if that’s what you want and annoying otherwise. The sep
argument lets you specify a different separator. Use an empty string
(""
) to run the strings together without separation:
paste("Everybody", "loves", "stats.", sep = "-")
#> [1] "Everybody-loves-stats."
paste("Everybody", "loves", "stats.", sep = "")
#> [1] "Everybodylovesstats."
It’s a common idiom to want to concatenate strings together with no separator at all. The function paste0
makes this very convenient:
The function is very forgiving about nonstring arguments. It tries to
convert them to strings using the as.character
function silently behind the scene:
paste("The square root of twice pi is approximately", sqrt(2 * pi))
#> [1] "The square root of twice pi is approximately 2.506628274631"
If one or more arguments are vectors of strings, paste
will generate
all combinations of the arguments (because of recycling):
stooges <- c("Moe", "Larry", "Curly")
paste(stooges, "loves", "stats.")
#> [1] "Moe loves stats." "Larry loves stats." "Curly loves stats."
Sometimes you want to join even those combinations into one big string.
The collapse
parameter lets you define a top-level separator and
instructs paste
to concatenate the generated strings using that
separator:
7.3 Extracting Substrings
Problem
You want to extract a portion of a string according to position.
Solution
Use substr(*string*,*start*,*end*)
to extract the substring that begins at
*start*
and ends at *end*
.
Discussion
The substr
function takes a string, a starting point, and an ending
point. It returns the substring between the starting and ending points:
substr("Statistics", 1, 4) # Extract first 4 characters
#> [1] "Stat"
substr("Statistics", 7, 10) # Extract last 4 characters
#> [1] "tics"
Just like many R functions, substr
lets the first argument be a vector
of strings. In that case, it applies itself to every string and returns
a vector of substrings:
ss <- c("Moe", "Larry", "Curly")
substr(ss, 1, 3) # Extract first 3 characters of each string
#> [1] "Moe" "Lar" "Cur"
In fact, all the arguments can be vectors, in which case substr
will
treat them as parallel vectors. From each string, it extracts the
substring delimited by the corresponding entries in the starting and
ending points. This can facilitate some useful tricks. For example, the
following code snippet extracts the last two characters from each
string; each substring starts on the penultimate character of the
original string and ends on the final character:
cities <- c("New York, NY", "Los Angeles, CA", "Peoria, IL")
substr(cities, nchar(cities) - 1, nchar(cities))
#> [1] "NY" "CA" "IL"
You can extend this trick into mind-numbing territory by exploiting the Recycling Rule, but we suggest you avoid the temptation.
7.4 Splitting a String According to a Delimiter
Problem
You want to split a string into substrings. The substrings are separated by a delimiter.
Solution
Use strsplit
, which takes two arguments: the string and the delimiter
of the substrings:
The delimiter
can be either a simple string or a regular expression.
Discussion
It is common for a string to contain multiple substrings separated by
the same delimiter. One example is a filepath, whose components are
separated by slashes (/
):
We can split that path into its components by using strsplit
with a
delimiter of /
:
Notice that the first “component” is actually an empty string because nothing preceded the first slash.
Also notice that strsplit
returns a list and that each element of the
list is a vector of substrings. This two-level structure is necessary
because the first argument can be a vector of strings. Each string is
split into its substrings (a vector), and then those vectors are returned in
a list.
If you are operating only on a single string, you can pop out the first element like this:
This example splits three filepaths and returns a three-element list:
paths <- c(
"/home/mike/data/trials.csv",
"/home/mike/data/errors.csv",
"/home/mike/corr/reject.doc"
)
strsplit(paths, "/")
#> [[1]]
#> [1] "" "home" "mike" "data" "trials.csv"
#>
#> [[2]]
#> [1] "" "home" "mike" "data" "errors.csv"
#>
#> [[3]]
#> [1] "" "home" "mike" "corr" "reject.doc"
The second argument of strsplit
(the delimiter
argument) is
actually much more powerful than these examples indicate. It can be a
regular expression, letting you match patterns far more complicated than
a simple string. In fact, to turn off the regular expression feature (and
its interpretation of special characters), you must include the
fixed=TRUE
argument.
See Also
To learn more about regular expressions in R, see the help page for
regexp
.
See O’Reilly’s Mastering Regular
Expressions, by Jeffrey E.F.
Friedl, to learn more about regular expressions in general.
7.5 Replacing Substrings
Problem
Within a string, you want to replace one substring with another.
Solution
Use sub
to replace the first instance of a substring:
Use gsub
to replace all instances of a substring:
Discussion
The sub
function finds the first instance of the old substring within
string and replaces it with the new substring:
str <- "Curly is the smart one. Curly is funny, too."
sub("Curly", "Moe", str)
#> [1] "Moe is the smart one. Curly is funny, too."
gsub
does the same thing, but it replaces all instances of the
substring (a global replace), not just the first:
To remove a substring altogether, simply set the new substring to be empty:
sub(" and SAS", "", "For really tough problems, you need R and SAS.")
#> [1] "For really tough problems, you need R."
The old argument can be regular expression, which allows you to match
patterns much more complicated than a simple string. This is actually
assumed by default, so you must set the fixed=TRUE
argument if you
don’t want sub
and gsub
to interpret old
as a regular expression.
See Also
To learn more about regular expressions in R, see the help page for
regexp
. See Mastering Regular
Expressions to learn more
about regular expressions in general.
7.6 Generating All Pairwise Combinations of Strings
Problem
You have two sets of strings, and you want to generate all combinations from those two sets (their Cartesian product).
Solution
Use the outer
and paste
functions together to generate the matrix of
all possible combinations:
Discussion
The outer
function is intended to form the outer product. However, it
allows a third argument to replace simple multiplication with any
function. In this recipe we replace multiplication with string
concatenation (paste
), and the result is all combinations of strings.
Suppose you have four test sites and three treatments:
We can apply outer
and paste
to generate all combinations of test
sites and treatments like so:
outer(locations, treatments, paste, sep = "-")
#> [,1] [,2] [,3]
#> [1,] "NY-T1" "NY-T2" "NY-T3"
#> [2,] "LA-T1" "LA-T2" "LA-T3"
#> [3,] "CHI-T1" "CHI-T2" "CHI-T3"
#> [4,] "HOU-T1" "HOU-T2" "HOU-T3"
The fourth argument of outer
is passed to paste
. In this case, we
passed sep="-"
in order to define a hyphen as the separator between
the strings.
The result of outer
is a matrix. If you want the combinations in a
vector instead, flatten the matrix using the as.vector
function.
In the special case when you are combining a set with itself and order does not matter, the result will be duplicate combinations:
outer(treatments, treatments, paste, sep = "-")
#> [,1] [,2] [,3]
#> [1,] "T1-T1" "T1-T2" "T1-T3"
#> [2,] "T2-T1" "T2-T2" "T2-T3"
#> [3,] "T3-T1" "T3-T2" "T3-T3"
Or we can use expand.grid
to get a pair of vectors representing all combinations:
expand.grid(treatments, treatments)
#> Var1 Var2
#> 1 T1 T1
#> 2 T2 T1
#> 3 T3 T1
#> 4 T1 T2
#> 5 T2 T2
#> 6 T3 T2
#> 7 T1 T3
#> 8 T2 T3
#> 9 T3 T3
But suppose we want all unique pairwise combinations of treatments. We
can eliminate the duplicates by removing the lower triangle (or upper
triangle). The lower.tri
function identifies that triangle, so
inverting it identifies all elements outside the lower triangle:
See Also
See Recipe 13.3, “Concatenating Strings”, for using paste
to generate
combinations of strings.
The gtools
package on CRAN has the functions combinations
and permutation
, which may be of help with related tasks.
7.7 Getting the Current Date
Problem
You need to know today’s date.
Discussion
The Sys.Date
function returns a Date
object. In the preceding
example it seems to return a string because the result is printed inside
double quotes. What really happened, however, is that Sys.Date
returned a Date
object and then R converted that object into a string
for printing purposes. You can see this by checking the class of the
result from Sys.Date
:
See Also
See Recipe 7.9, “Converting a Date into a String”.
7.8 Converting a String into a Date
Problem
You have the string representation of a date, such as "2018-12-31"
, and
you want to convert that into a Date
object.
Solution
You can use as.Date
, but you must know the format of the string. By
default, as.Date
assumes the string looks like yyyy-mm-dd. To handle
other formats, you must specify the format
parameter of as.Date
. Use
format="%m/%d/%Y"
if the date is in American style, for instance.
Discussion
This example shows the default format assumed by as.Date
, which is the
ISO 8601 standard format of yyyy-mm-dd:
The as.Date
function returns a Date
object that (as in the prior recipe) is being
converted here back to a string for printing; this explains the double quotes
around the output.
The string can be in other formats, but you must provide a format
argument so that as.Date
can interpret your string. See the help page
for the stftime
function for details about allowed formats.
Being simple Americans, we often mistakenly try to convert the usual
American date format (mm/dd/yyyy) into a Date
object, with these
unhappy results:
as.Date("12/31/2018")
#> Error in charToDate(x): character string is not in a standard unambiguous format
Here is the correct way to convert an American-style date:
Observe that the Y
in the format string is capitalized to indicate a
four-digit year. If you’re using two-digit years, specify a lowercase y
.
7.9 Converting a Date into a String
Problem
You want to convert a Date
object into a character string, usually
because you want to print the date.
Solution
Use either format
or as.character
:
Both functions allow a format
argument that controls the formatting.
Use format="%m/%d/%Y"
to get American-style dates, for example:
Discussion
The format
argument defines the appearance of the resulting string.
Normal characters, such as slash (/
) or hyphen (-
) are simply copied
to the output string. Each two-letter combination of a percent sign
(%
) followed by another character has special meaning. Some common
ones are:
%b
Abbreviated month name (“Jan”)
%B
Full month name (“January”)
%d
Day as a two-digit number
%m
Month as a two-digit number
%y
Year without century (00–99)
%Y
Year with century
See the help page for the strftime
function for a complete list of
formatting codes.
7.10 Converting Year, Month, and Day into a Date
Problem
You have a date represented by its year, month, and day in different variables. You want to
merge these elements into a single Date
object representation.
Solution
Use the ISOdate
function:
The result is a POSIXct
object that you can convert into a Date
object:
Discussion
It is common for input data to contain dates encoded as three numbers:
year, month, and day. The ISOdate
function can combine them into a
POSIXct
object:
You can keep your date in the POSIXct
format. However, when working
with pure dates (not dates and times), we often convert to a Date
object and truncate the unused time information:
Trying to convert an invalid date results in NA
:
ISOdate
can process entire vectors of years, months, and days, which
is quite handy for mass conversion of input data. The following example
starts with the year/month/day numbers for the third Wednesday in
January of several years and then combines them all into Date
objects:
years <- c(2010, 2011, 2012, 2014)
months <- c(1, 1, 1, 1, 1)
days <- c(15, 21, 20, 18, 17)
ISOdate(years, months, days)
#> [1] "2010-01-15 12:00:00 GMT" "2011-01-21 12:00:00 GMT"
#> [3] "2012-01-20 12:00:00 GMT" "2014-01-18 12:00:00 GMT"
#> [5] "2010-01-17 12:00:00 GMT"
as.Date(ISOdate(years, months, days))
#> [1] "2010-01-15" "2011-01-21" "2012-01-20" "2014-01-18" "2010-01-17"
Purists will note that the vector of months is redundant and that the last expression can therefore be further simplified by invoking the Recycling Rule:
as.Date(ISOdate(years, 1, days))
#> [1] "2010-01-15" "2011-01-21" "2012-01-20" "2014-01-18" "2010-01-17"
You can also extend this recipe to handle year, month, day, hour,
minute, and second data by using the ISOdatetime
function (see the
help page for details):
7.11 Getting the Julian Date
Problem
Given a Date
object, you want to extract the Julian date—which is, in
R, the number of days since January 1, 1970.
Solution
Either convert the Date
object to an integer or use the julian
function:
Discussion
A Julian “date” is simply the number of days since a more-or-less arbitrary starting point. In the case of R, that starting point is January 1, 1970, the same starting point as Unix systems. So the Julian date for January 1, 1970 is zero, as shown here:
7.12 Extracting the Parts of a Date
Problem
Given a Date
object, you want to extract a date part such as the day
of the week, the day of the year, the calendar day, the calendar month,
or the calendar year.
Solution
Convert the Date
object to a POSIXlt
object, which is a list of date
parts. Then extract the desired part from that list:
Discussion
The POSIXlt
object represents a date as a list of date parts. Convert
your Date
object to POSIXlt
by using the as.POSIXlt
function,
which will give you a list with these members:
sec
Seconds (0–61)
min
Minutes (0–59)
hour
Hours (0–23)
mday
Day of the month (1–31)
mon
Month (0–11)
year
Years since 1900
wday
Day of the week (0–6, 0 = Sunday)
yday
Day of the year (0–365)
isdst
Daylight Saving Time flag
Using these date parts, we can learn that April 2, 2020, is a Thursday
(wday
= 4) and the 93rd day of the year (because yday
= 0 on January
1):
A common mistake is failing to add 1900 to the year, giving the impression you are living a long, long time ago:
7.13 Creating a Sequence of Dates
Problem
You want to create a sequence of dates, such as a sequence of daily, monthly, or annual dates.
Solution
The seq
function is a generic function that has a version for Date
objects. It can create a Date
sequence similarly to the way it creates
a sequence of numbers.
Discussion
A typical use of seq
specifies a starting date (from
), ending date
(to
), and increment (by
). An increment of 1 indicates daily dates:
s <- as.Date("2019-01-01")
e <- as.Date("2019-02-01")
seq(from = s, to = e, by = 1) # One month of dates
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06" "2019-01-07" "2019-01-08" "2019-01-09" "2019-01-10"
#> [11] "2019-01-11" "2019-01-12" "2019-01-13" "2019-01-14" "2019-01-15"
#> [16] "2019-01-16" "2019-01-17" "2019-01-18" "2019-01-19" "2019-01-20"
#> [21] "2019-01-21" "2019-01-22" "2019-01-23" "2019-01-24" "2019-01-25"
#> [26] "2019-01-26" "2019-01-27" "2019-01-28" "2019-01-29" "2019-01-30"
#> [31] "2019-01-31" "2019-02-01"
Another typical use specifies a starting date (from
), increment
(by
), and number of dates (length.out
):
seq(from = s, by = 1, length.out = 7) # Dates, one week apart
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06" "2019-01-07"
The increment (by
) is flexible and can be specified in days, weeks,
months, or years:
seq(from = s, by = "month", length.out = 12) # First of the month for one year
#> [1] "2019-01-01" "2019-02-01" "2019-03-01" "2019-04-01" "2019-05-01"
#> [6] "2019-06-01" "2019-07-01" "2019-08-01" "2019-09-01" "2019-10-01"
#> [11] "2019-11-01" "2019-12-01"
seq(from = s, by = "3 months", length.out = 4) # Quarterly dates for one year
#> [1] "2019-01-01" "2019-04-01" "2019-07-01" "2019-10-01"
seq(from = s, by = "year", length.out = 10) # Year-start dates for one decade
#> [1] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01" "2023-01-01"
#> [6] "2024-01-01" "2025-01-01" "2026-01-01" "2027-01-01" "2028-01-01"
Be careful with by="month"
near month-end. In this example, the end of
February overflows into March, which is probably not what you wanted: