Sections 1-5 are introductory explanations of this tutorial and R. Sections 6-9 are the main learning parts of this tutorial. Section 10 is a summary of the previous sections. Sections 11-16 are much more advanced and will introduce importation of data into R, descriptive statistics, graphics, control-flow constructs, functions and project-oriented workflow, respectively. Sections 17-21 can be seen as appendices for future reference.

1 Goal of this tutorial

It is not intended to make you an R expert, at least not yet. This is impossible to achieve in a few hours or even days. But some aspects of the R logic are essential to understand and master in order to be able to learn by yourself. These basics are what I, through this tutorial, will try to provide you.
Clearly, this is not the most exciting part, but that might be the most difficult one to learn, especially by yourself. Once you are fluent with the concepts addressed here, you will be independent with R. These concepts relate mainly to types (see section Modes), structures (see section Classes) and manipulation (see section Manipulate objects) of objects.

This tutorial addresses only the use of R, its logic and syntax. Some applications (descriptive stats, graphics…) will only be shortly covered (see sections Descriptive statistics and Graphics). These applications, and others, will not be addressed in details for several reasons. First, each useR has specific applications; as such, there are too many possible uses of R to address them all here. Second, if you focus too much on applications before being able to correctly use R, you will be able to use R by copying/editing pieces of code found online, but you might not necessarily understand these chunks of code and you might therefore be unable to adapt them to your needs. You might even wrongly believe these chunks are adapted to your data. In the long run, you will probably be able to understand how R works, but that would take even more time and energy (experience talking here!). Third, if you understand the functioning of R, you will have no problem learning the different applications. And fourth, I am not an expert in statistics and data sciences, so I do not want to step into this territory.

I will try to emphasize terminology and to highlight some “good coding practices”. I believe it is important to use the correct terms (to be able to discuss with colleagues or just to ask for help on forums/mailing lists) and to clearly organize the code (to make it readable for yourself, potential helpers and colleagues). See Coding style for details.

Finally, this tutorial is very long. Do not expect to do it in one day, or even a few days. The learning process will take time; you will also probably need to rework on some sections several times until you have completely understood and mastered the content. Also, remember that, once you will be done with this tutorial, it is very important to use R regularly. Like any language, you will loose it if you do not practice. So make sure you start this tutorial with enough time to go through it first, but also with enough time over the next months to consolidate what you have learned.

2 What’s R?

R is an open-source (i.e., among other things, free) software package for data mining and analysis, including graphics.
There are basically no menus and buttons to click on; everything works through command lines (but see e.g. R commander Rcmdr and easieR). That makes it difficult, especially at the beginning, but it is also why everything is possible; you are not limited by what is available in the menus.

R is being developed by a community of useRs who can contribute with packages (see section Packages). There are thousands of packages, each containing functions (see section Function) to run specific operations.

3 How to install R?

Visit the R-Project website.
In the list on the left side, click on ‘CRAN’ (Comprehensive R Archive Network) under ‘Download’ and choose a download mirror.
Under ‘Download and Install R’, click on the link corresponding to your OS.

3.1 Windows

Click on ‘base’ or ‘install R for the first time’, and ‘Download R X.X.X for Windows’. Download and execute the installation file. Follow instructions.
I advise you to install as administrator. It is not required, but I have the impression that things do not always work properly if installed in the user folders. To do so, right click on the installation file and select “run as administrator”. Do the same to install packages (see section Packages for details).

You can choose to install message translations (translations are not always available and are not necessarily good, but it can help).
‘Start options’: if you choose ‘yes’, you can then choose whether R opens in a single multi-panel window (good with 1 screen, default mode) or whether there are several separated windows (good with several screens). But this is not critical if you use RStudio anyway (see section RStudio).

3.2 macOS

It is recommended to use a computer with macOS 11 Big Sur or later, even though earlier versions of Mac OS are supported with previous R versions (installation of R and packages is more complicated).

Download the installation file (‘R-X.X.X.pkg’) and start the installation. Follow instructions.

As explained on the R website, the latest version of XQuartz is required, so download (https://www.xquartz.org/) and install it.

3.3 Linux

Untested and probably outdated, but I guess Linux users will manage to install it on their own!

To install R under Ubuntu (an LTS version 14.04 or 16.04 is recommended) or another Linux distribution, just look for and install the packages r-base and r-base-dev from the package manager.

It is also possible to add the CRAN website to the list of repositories, so that R packages can be downloaded, installed and updated from the package manager. Check the section ‘Download R for Linux’ on the CRAN for more details.

3.4 Install RStudio

RStudio is an interface that, among other more advanced features (see e.g. section Data, scripting, projects and repeatability), improve some functionalities of the editor (colors, auto-completion, syntax checking…) and is very useful, especially for Windows and Linux useRs. Other so-called integrated development environments (IDE) are available for R, but in any case, it is highly recommended to use one.

Go to the Posit/RStudio website, click on ‘DOWNLOAD RSTUDIO DESKTOP’ and download the installation file corresponding to your OS (‘RSTUDIO-tudio X.X.X.exe or .dmg’). Start installation and follow instructions.

4 Tutorial

Before going further, here are some explanations regarding this tutorial.

It contains text and R code. R code is greyish, and output is displayed in white frames (at least in the HTML file).

In the exercises below, code is hidden by default. Try before displaying the answer code. You can then display it (button hide/code on the right) to check your answer/try. Explanations are usually given right below the code. So do not scroll down too fast!

If you notice any mistakes or typos, please let me know!

5 R programming

5.1 The console

When you open R, you can find a menu bar and a windows named R Console. It is in this window that commands should be written to run them. The prompt (symbol >) followed by the blinking cursor shows that R is ready!
Some operations are long to run; as long as the > and the cursor are not displayed on the last line of the console, R is working. In these cases, you just have to wait. In case of problems, the STOP button can stop running operations.

Input, output, warnings and errors are all displayed in the console with different colors.

It is possible to navigate through the command history using the keyboard up and down arrows.

5.2 The editor

Using the menu bar, you can create a new document, either with the dedicated icon or through File > New file/document/script. A new window will open, the editor.
The editor is used to create scripts, i.e. files that contain a list of commands to run on several datasets. It is possible to save these scripts as any other file (CTRL+s on Windows/Linux, cmd+s on macOS). I therefore advise you, even during your learning sessions, to write all your commands in the editor and to save the file, instead of using the console only and saving the workspace (see section Exit R).

In the editor, select the command(s) to run and use the key combination CTRL+r (R Windows), CTRL+ENTER (RStudio Windows) or cmd+ENTER (macOS) to send it (them) automatically to the console. That’s way easier and less error-prone than copy/paste.

The symbol # indicates comments in a chunk of R code. Do not hesitate to comment your script!

The macOS editor is really good. For Windows and Linux useRs, I recommend using RStudio (see sections Install RStudio and RStudio).

Try with the script initiationR.R in the main folder.

5.3 Exit R

When you exit R, a pop-up appears asking whether you want to save the workspace. It is strongly advised against saving the workspace to avoid that data (and errors) are loaded with a new R session.
The RGui for macOS and RStudio can define these settings globally.

5.4 RStudio

The organization of the RStudio window is a bit different, but the same panels are present: console and editor, plus two other panels (files, packages, graphs, help and history). These panels can be reorganized in the preferences/options.

In the settings (Tools > Global Options > General), I (and others) advise you to untick the boxes to ‘Restore .RData into workspace at startup’ and ‘Always save history (even when not saving.RData)’, and to set ‘Save workspace to .RData on exit’ to ‘Never’, as explained in the section Exit R.

5.5 Coding style

I advise you to follow some coding style to make your code easier to read, both for you and for anyone trying to decipher it!
The tidyverse style guide is a very good reference. I try to follow it throughout the tutorial.

5.6 Debugging

This part is largely copied from John K. Kruschke (2015), “Doing Bayesian data analysis: a tutorial with R, JAGS, and Stan”, 2^nd edition (p. 82).

When you encounter an error, here are some hints regarding how to diagnose and repair the problem:

The error messages displayed by R can sometimes be cryptic but are usually very helpful to isolate where the error occurred and what caused it. When an error message appears, don’t just ignore what it says. Be sure to actually read the error message and see if it makes sense. Remember too that Google is your friend (well, use Swisscows, Brave Search or similar instead if you value your privacy).
Isolate the first point in the code that causes an error. Run the code sequentially a line at a time, or in small blocks, until the first error is encountered. Fix this error, and it might cure subsequent problems too.
When you have found that a particular line of code causes an error, and that line involves a complex nesting of embedded functions, check the nested functions from the inside out. Do this by selecting, with the cursor, the inner-most variable or function and running it. Then work outward.
If you have defined a new function that had been working but mysteriously has stopped working, be sure that it is not relying on a variable that is defined outside the function. For examples, suppose that you specify N=30 at the command line without putting it into the script. Then, in the script, you define a function, addN <- function(x) {x+N}. The function will work without complaint until you start a new R session that does not have N defined. In general, be wary of using variables inside a function without explicitly putting them as arguments in the function definition. See section How to write a function
Once you have a working script, save it, close your R session (without saving the workspace of course, see section Exit R) and start a new one. Now test your script again. It happens often that you define and re-define and re-re-define variables while developing a script. At some point, you loose track of it. Restarting with a fresh session makes sure that the variables defined in your script are sufficient to run it.

For more advanced functions, see ?debug, ?browser and ?traceback (see here for details).
RStudio also has debugging functionalities (see RStudio support article Debugging with RStudio).

For the examples below and for the rest of this document, you can copy/paste commands into the console (or even better: into a script and send them to the console) to try by yourself to understand how it works. Do try everything that comes to your mind; the best way to learn is to try and make mistakes (which can be translated as: “By constantly trying, you end up succeeding. Therefore: the more you fail, the more likely it will work”)!

6 What is an object?

6.1 Definition

This is the definition of the R Language Definition:
“In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects. These objects are referred to through symbols or variables. In R, however, the symbols are themselves objects and can be manipulated in the same way as any other object. This is different from many other languages and has wide ranging effects.”

So, R works with objects: data are stored into objects, these objects are manipulated and operations are run on these objects. In other words, an object is the basic unit in R, a variable that can contain data of any type (see section Modes) and any structure (see section Classes).
This implies that data must be stored into an object to be usable: even if data appear in the console, R will not be able to work with these data if they are not stored into an object.

Objects exist only in the R workspace as long as they are not saved as files. This means that:

Imported data files are not modified (except if you overwrite them when saving R objects with the same file names)
If you exit R, all unsaved objects will be lost

6.2 Assignment

Most of the time, you create an object directly from the output of a function, another object, or from numeric/character values. To assign data to an object, one of these symbols should be used:

<- (lower than followed by a minus without space in between), or
= (but I do not recommend it to avoid confusions, as recommended on the tidyverse style guide, on this blog entry and on that one)

I recommend to always leave a space before and after these symbols to have clean code (see also the tidyverse style guide).

For example, to assign the value 1 to an object named x:
x <- 1

The value has been assigned to x, but no ‘result’ is displayed in the console. Indeed, we just asked R to create x; we did not ask to display x. To display the values stored into an object, just type its name into the console:
x

[1] 1

Note that if you type 1 in the console, R will display directly the value in the console, indicating that no object has been created; the command just asked to display the value 1:
1

[1] 1

It is also possible to store more complex information. For example, it is possible to combine values using the function c(), with the different values being separated by commas:
y <- c(1.5, 3, 10.05)
y

[1]  1.50  3.00 10.05

The function : creates an integer sequence between two limits. To store the integer sequence from 1 to 10 into an object named z:
z <- 1:10
z

 [1]  1  2  3  4  5  6  7  8  9 10

To assign the mean (function mean()) of an integer sequence from 1 to 10 into an object named mean1 :
mean1 <- mean(1:10)
mean1

[1] 5.5

It is also possible to use objects created previously:
mean2 <- mean(z)
mean2

[1] 5.5

Both means are identical because z contains the integers 1 to 10. In the case of mean2, we applied the function mean() to the object z.

Applying operations to existing objects is how R works.

By the way, RStudio for Windows has a keyboard shortcut to write the <- operator (including the space before and after): ALT + -

6.3 Naming of objects

Objects can be named as wanted. But you should make sure that these names reflect which data the object contains!

There are nevertheless 3 important rules for naming objects:

the name can contain numbers, but it must start with a letter
. and _ are allowed, the former being also allowed at the beginning of a name. All other symbols are forbidden
R is case-sensitive (this means that R is different from r)
Some words are reserved in R and cannot be used for objects: see ?Reserved

Examples of valid and different names: my.data, my_data, My.Data
Examples of invalid names: 2data, /data, $data

Refer to the tidyverse style guide for good object names. Others prefer to use the camelBack notation.

A common error in R is: Error: object 'x' not found
There can be 3 reasons:

You misspelled the name of the object (remember: R is case-sensitive)
This object does not exist because you have not created it (no assignment). Typing ls() into the console lists the existing objects
You meant a character string and not an object name; in this case, quote the string (see section Character)

7 Modes

Objects can be of different modes and classes (see section Classes).

The mode of an object corresponds to the type of data it contains.
The main modes are: numeric, character, logical and function.

It is possible to check the mode of an object with the function mode():
x <- 1:10
mode(x)

[1] "numeric"

The function typeof() is more detailed:
x <- 1:10
typeof(x)

[1] "integer"

7.1 Numeric

Objects of mode numeric contain obviously numbers: integer (1 ; 20 ; 500 ; -3), double (1.00 ; 20.25 ; 500.1 ; -3.55) or complex (3+2i).

See section Modes above for an example of a numeric (double).

7.2 Character

A character is a value containing at least one letter: “abc”, “a1c”, “2bc”, ” 2” (note the space before the 2).
In R, characters will always be displayed quoted. Single (‘example’) and double (“example”) quotation marks are equivalent; just make sure that you close the quotation with the same symbol used for opening it. Still, it is recommended to use double quotes whenever possible.

For example:
my_char <- "abc"
my_char

[1] "abc"

mode(my_char)

[1] "character"

typeof(my_char)

[1] "character"

Numbers can also be stored as characters:
my_num <- "2"
my_num

[1] "2"

mode(my_num)

[1] "character"

typeof(my_num)

[1] "character"

7.3 Logical

The mode logical corresponds to only two possible values: TRUE and FALSE, with upper case letters and without quotation marks. When converted to numeric, these values correspond to 1 and 0, respectively.

On some documentation, it is possible to see the abbreviations T and F. This works most of the time, but it is strongly advised to always use the complete versions TRUE and FALSE.

For example:
my_logi <- TRUE
my_logi

[1] TRUE

my_logi_abb <- T
my_logi_abb

[1] TRUE

mode(my_logi)

[1] "logical"

typeof(my_logi)

[1] "logical"

7.4 Function

7.4.1 General information

A function applies one or several operations to an object and outputs a result.
Many functions are built in base R (for example mean()), and there exist many more in contributed packages (see section Packages). It is also possible to create your own functions (see section How to write a function).

An object of mode function does not contain the result but only the list of operations or commands. This list of commands can be displayed by typing the name of the function (remember: a function is an object) in the console, for example mean:

function (x, ...) 
UseMethod("mean")
<bytecode: 0x000001f71cb156a8>
<environment: namespace:base>

But in general, you want to call this function on a data object (input) and store the result (output) into an other object. In this case, brackets are needed after the function’s name to tell R that we want to call the function and to apply its operations.
That is what we have done previously:
x <- c(3, 4, 6, 8, 12, 15, 20) # The function c() is called to combine values into the data object x
x

[1]  3  4  6  8 12 15 20

mode(x) # The function mode() is called to output (in this case also to display) the mode of x

[1] "numeric"

mean(x) # The function mean() is called to compute the mean of x

[1] 9.714286

To get the help page of a function, just type its name preceded by a question mark, or use the function help(), in the console (see section Understand the help page of a function):
?mean
help(mean)

7.4.2 Arguments

Arguments are the options/settings of a function. All of them are named and these names are given in the function’s definition on the help page. Arguments are listed within the brackets of the function’s call, separated by commas, and the symbol = is used to assign values (options) to arguments.

It is not necessary to name an argument to set it; the order is enough. But it is still advisable to spend the few seconds typing the names to make sure that you or someone else will know what value corresponds to which argument.

Some arguments are given default values that can be identified in the help page: the name of the argument is then followed by the symbol = and the default value. These default values are used when no value has been assigned to these arguments in the function’s call.
If an argument does not have a default value, a value must be assigned to it in the function’s call; if not, R will return this type of error argument x is missing, with no default value.

In the function’s call, arguments can be used in three different ways:

arguments can be ignored so that default values are used (only for arguments having default values)
arguments can be named to assign a value to them with the symbol =; unlisted arguments will be assigned default values
values can be assigned to unnamed arguments, but the order of the values should correspond to the order of arguments as defined in the function’s definition. It is not necessary to assign values to all arguments: for example, if you list 3 values, these values will be assigned to the 3 first arguments and default values will be assigned for the following arguments

7.4.3 Example

Let us take an example, with the function matrix() to create matrices, which we will use in the section Matrix.
On its help page, you can see that it has 5 arguments named data, nrow, ncol, byrow and dimnames, in this order.
They all have default values because they are all followed by = and a value (respectively: NA, 1, 1, FALSE and NULL).

The argument data asks for input data. By default, missing values (NA, for ‘not available’, corresponds to the empty cells in MS Excel) fill the matrix. Arguments nrow and ncol indicate the number of rows and columns, respectively, of the matrix, by default 1 row and 1 column. Let us ignore the last two arguments.

These 3 lines are identical and allow to create a matrix containing the integers from 1 to 10 distributed over 5 rows and 2 columns:

matrix(data = 1:10, nrow = 5, ncol = 2) : the argument data contains the integers 1 to 10, arguments nrow and ncol set up the number of rows and columns respectively, and the default values are used for arguments byrow and dimnames so they are not listed
matrix(1:10, 5, 2) : same thing but without naming the first three arguments. The order is then crucial (compare with matrix(1:10, 2, 5) for example)
matrix(ncol = 2, nrow = 5, data = 1:10) : named arguments can been listed in any order

7.5 Exercises

Create objects of mode numeric, character and logical. Check their modes with the functions mode() and typeof(). Compare the outputs of each mode.
Also check the mode of the four functions that we already used (mode, mean, c and matrix) as objects.
Create a matrix mat with 2 rows and 3 columns filled with the letters “a” to “f”.
How many arguments does the function mean() have? What are their names and default values?

Answers:

# Object of mode numeric
x <- 1:10
mode(x)
typeof(x)
# Output without quotation marks
x
    
# Object of mode character
y <- c("abc", "def")
mode(y)
typeof(y)
# Output with quotation marks
y
    
# Object of mode logical
z <- c(TRUE, TRUE, FALSE)
mode(z)
typeof(z)
# Output without quotation marks
z
    
# Objects of mode function
mode(mean)
typeof(mean)
mode(mode)
typeof(mode)
mode(c)
typeof(c)
mode(matrix)
typeof(matrix)

# Matrix
let <- c("a", "b", "c", "d", "e", "f")
mat <- matrix(data = let, nrow = 2, ncol = 3)
mat <- matrix(let, 2, 3)
mat <- matrix(nrow = 2, data = let, ncol = 3)

# mean()
# The function has 3 arguments: x, trim and na.rm
# Their default values are: 0 and FALSE for trim and na.rm respectively
# x does not have default value, so it is necessary to assign it a value in the function's call

8 Classes

The class of an object is an attribute that dictates how that object should be treated by functions. In most cases, it mirros the structure of its data. Here I present the main classes used in R. New classes can be defined, and some functions actually assign new classes to objects (see e.g. section Reading XLS or XLSX files directly).

8.1 Vector

A vector is the fundamental unit in R. You can see a vector as a group/collection/assemblage of values (numbers, characters…).
A vector, as seen form our human point-of-view, has one dimension. But in R, it does not have a dimension attribute and can therefore be considered dimensionless (just accept it!).

All the elements (i.e. values) of a vector must be of the same mode. If required, the elements will be coerced (i.e. converted) into a common mode. For example, a vector containing the values 1, 2, 3, a, 5 will be of character mode, with numbers coerced into characters (because characters cannot be coerced into numbers).

The function c() that we already used (see section Assignment) combines data into a vector (some values are coerced if necessary).

The function class() outputs the class of an object.

Examples:
x <- 1:10
mode(x)

[1] "numeric"

class(x)

[1] "integer"

y <- c("a", "b", "c")
mode(y)

[1] "character"

class(y)

[1] "character"

z <- c(1, 2, 3, "a", 5)
z

[1] "1" "2" "3" "a" "5"

mode(z)

[1] "character"

class(z)

[1] "character"

And just to show that a vector has no dimension:
dim(z)

NULL

8.2 Matrix

Matrices are 2-dimensional arrays (see section Array). They are presented as tables.
From R 4.0.0, they are of class matrix and array.

Example (see also 7.4.3 Example):
mat <- matrix(x, nrow = 5, ncol = 2)
mat

     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

mode(mat)

[1] "numeric"

class(mat)

[1] "matrix" "array"

8.3 Array

An array is a single vector with dimensions. It means that all elements of an array have to be of the same mode. Moreover, the number of elements should equal the product of the lengths of the dimensions: in the case of a matrix (i.e. a 2D array), the number of elements should be equal to the number of columns multiplied by the number of rows.

Example:
arr <- array(1:12, dim = c(2, 3, 2))
arr

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Number of elements = 1:12 = 12
Product of lengths of dimensions = 2x3x2 = 12

mode(arr)

[1] "numeric"

class(arr)

[1] "array"

8.4 List

Lists are the most flexible objects: elements can be of different modes and lengths.
A list can have a hierarchical structure, meaning that each element of a list can be a vector, an array or a list, and so on.
The only constraint is that each element of the lowest level is a vector; its elements are therefore of the same mode.

Examples:
my.list <- list(a = 1:3, b = c("a", "d"))
my.list

$a
[1] 1 2 3

$b
[1] "a" "d"

mode(my.list)

[1] "list"

class(my.list)

[1] "list"

A list is of mode and class list!

my.list2 <- list(a = list(num = 1:3, let = c("a", "b")), b = mat, d = c(4, 5, "6d", 7))
my.list2

$a
$a$num
[1] 1 2 3

$a$let
[1] "a" "b"


$b
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

$d
[1] "4"  "5"  "6d" "7"

my.list2 is a list composed of 3 elements (a, b and d).
Element 1 of my.list2 is named a and is a list composed of 2 elements (num and let).
Element 1 of a is named num and is a vector of mode numeric composed of 3 elements (1, 2 and 3).
Element 2 of a is named let and is a vector of mode character composed of 2 elements (a and b, which have nothing to do with the elements of my.list2).

Element 2 of my.list2 is named b and is a matrix of mode numeric with values coming from mat (i.e. 1:10).

Element 3 of my.list2 is named d and is a vector of mode character composed of 4 elements ("4", "5", "6d" et "7", all coerced to characters).

When lines of code become long and complex with several nested function’s calls, use spaces or line breaks to separate units (see tidyverse style guide).

8.5 Data.frame

To simplify (this holds true in most cases), a data.frame is a list with only one hierarchical level, and all elements have to have the same length (i.e. the same number of sub-elements). The elements are vectors (so all sub-elements have the same mode), but the different elements can be of different modes. Data.frames are usually used to store tables: each column is an element (vector) of the data.frame and columns can be of different modes. It is normal that columns have the same number of rows (so that elements of the data.frame have the same length, including missing values).

Even if matrices and data.frames look similar, these classes are very different in R. A matrix is ONE vector distributed in rows and columns. A data.frame is a LIST of vectors, presented in columns.

Example:
my.df <- data.frame(num = 1:3, let = c("a", "b", "c"))
my.df

  num let
1   1   a
2   2   b
3   3   c

Rows are numbered if they are unnamed. So be careful not to mix the row numbers with the values of the first column (num).

mode(my.df)

[1] "list"

Data.frames are of mode list, which is expected.

class(my.df)

[1] "data.frame"

Computations are generally faster on matrices. So with large datasets, matrices might be a better option than data.frames (if at all possible).

8.6 Factor

A factor is a vector containing characters coded as numeric. It is actually composed of an integer vector, each integer being associated to an attribute (label).

In other words, factors are the categorical, discrete variables used in statistics, for example sex (possible values = male or female), eye color (possible values = blue, brown, green…), etc. Each possible value, called level, is associated with one integer (from 1 to the number of unique values).

Some software packages need to create ‘dummy variables’ to treat this kind of data. R processes these discrete variables easily with factors, both for stats and graphs.

Before R 4.0.0, by default, when you created a data.frame, columns were converted to factors. This is not the case anymore, so be careful when you use older code.

Factors look more like a mode, but for R, they represent a class.

Examples:
my.fac <- factor(c("woman", "man", "woman", "man"))
my.fac

[1] woman man   woman man  
Levels: man woman

Note that the values are of mode numeric (this is why they are not quoted)…
mode(my.fac)

[1] "numeric"

… and of class factor:
class(my.fac)

[1] "factor"

By converting a factor to numeric, it is possible to observe the numerical representation of factors:
as.numeric(my.fac)

[1] 2 1 2 1

Levels are by default ordered alphabetically:
levels(my.fac)

[1] "man"   "woman"

It is of course possible to change the order of the levels. Let’s use another example, with clothing sizes:
my.fac2 <- factor(c("S", "S", "M", "M", "L", "L", "XL", "XL"))
my.fac2

[1] S  S  M  M  L  L  XL XL
Levels: L M S XL

To reorder a factor, re-apply the function factor() but with the argument levels:
my.fac2.new.lev <- factor(my.fac2, levels = c("S", "M", "L", "XL"))
my.fac2.new.lev

[1] S  S  M  M  L  L  XL XL
Levels: S M L XL

Note that the order of the levels has changed, but that the values have not.

However, even though the order makes sense for us, it does not for R, meaning that comparisons are not meaningful (see section Manipulate objects for explanations on the meaning of the square brackets):
my.fac.new.lev[1] < my.fac.new.lev[3]

Warning in Ops.factor(my.fac2.new.lev[1], my.fac2.new.lev[3]): '<' not
meaningful for factors

[1] NA

In that case, it is possible to tell R that the factors are ordered, i.e. that they are sorted from small to large:
my.fac2.order <- factor(my.fac2, levels = c("S", "M", "L", "XL"), ordered = TRUE)
my.fac2.order

[1] S  S  M  M  L  L  XL XL
Levels: S < M < L < XL

my.fac2.order[1] < my.fac2.order[3]

[1] TRUE

If you want to change the labels of your factors, use the argument labels to the factor() function:
my.fac2.order.new.lab <- factor(my.fac2.order, labels = c("Small", "Medium", "Large", "Extra large"))
my.fac2.order.new.lab

[1] Small       Small       Medium      Medium      Large       Large      
[7] Extra large Extra large
Levels: Small < Medium < Large < Extra large

This changes the values, so be careful with the order when you do it. For example, this could lead to disaster in subsequent analyses:
my.fac2.order.wrong <- factor(my.fac2.order, labels = c("Medium", "Small", "Large", "Extra large"))
my.fac2.order.wrong

[1] Medium      Medium      Small       Small       Large       Large      
[7] Extra large Extra large
Levels: Medium < Small < Large < Extra large

For more details, see e.g. this page on factor manipulations with base R, or the package forcats.

8.7 Dates and time

R can of course deal with dates. Dates are actually numeric values computed in number of days since a predefined origin. This origin is not on the same day in R (1970-01-01 at midnight UTC), in MS Excel (1900-01-01 but this is different in older macOS versions), and in LibreOffice Calc (1899-12-30) which does not make things easy, although it seems things have improved in the last years already.
All dates are stored in R in ISO format (i.e. YYYY-MM-DD).

There exist lots of functions and packages specifically designed for manipulating dates and time (e.g. from the packages chron and lubridate), but we will not get into details here.
Because of the different origins in different software packages, it is easier to read in / convert dates from characters than from numeric values.

Examples:
dat <- as.Date("27-07-2016", format = "%d-%m-%Y")
dat

[1] "2016-07-27"

mode(dat)

[1] "numeric"

class(dat)

[1] "Date"

as.numeric(dat)

[1] 17009

The function as.POSIXlt() can be used for date+time values, for example: datetime <- as.POSIXlt("27-07-2016 11:30:25", format = "%d-%m-%Y %H:%M:%S", tz = "Europe/Berlin")
datetime

[1] "2016-07-27 11:30:25 CEST"

8.8 Exercises

Create objects of class vector, matrix, array, list (simple and nested) and data.frame. Check their structures with the functions class() and str(). Compare the outputs of each class.
Create a vector of class factor with values for eye colors. Make sure to order the levels from lightest to darkest.
Create a character vector with dates given in the format “11 August 2023”, and convert it to a date vector using as.Date() and as.POSIXlt().

Answers:

# Object of class vector
x <- 1:10
class(x)
str(x)
# Output without ordering
x
    
# Object of class matrix
mat <- matrix(x, nrow = 5, ncol = 2)  
class(mat)
str(mat)
# Output with rows and columns, all values have the same mode
mat
    
# Object of class array
arr <- array(1:12, dim = c(2, 3, 2)) 
class(arr)
str(arr)
# Output with more than 2 dimensions, all values have the same mode
arr

# Object of class list (simple)
my.list <- list(a = 1:3, b = c("a", "d")) 
class(my.list)
str(my.list)
# Output with groups of vectors
my.list

# Object of class list (nested)
my.list2 <- list(a = list(num = 1:3, let = c("a", "b")), b = mat, d = c(4, 5, "6d", 7))
class(my.list2)
str(my.list2)
# Output with hierarchical groups of vectors
my.list2

# Object of class data.frame
my.df <- data.frame(num = 1:3, let = c("a", "b", "c"))
class(my.df)
str(my.df)
# Output with rows and columns, values in different columns may have different modes
my.df

# Create factor vector for eye color
eye.colors <- factor(c("blue", "blue", "green", "brown", "green", "brown"))
eye.colors

# Re-order the levels from lightest to darkest.
eye.colors2 <- factor(eye.colors, levels = c("blue", "green", "brown")) 
eye.colors2

# Create character vector of dates
date.german <- c("11 August 2023", "14 August 2023")
date.german

# Convert to dates
date.iso <- as.Date(date.german, format = "%d %B %Y")
date.iso
date.posix <- as.POSIXlt(date.german, format = "%d %B %Y")
date.posix

9 Manipulate objects

That’s where you start working!

This section addresses the manipulation of objects, and especially how to subset objects. It is essential because you will not want to always apply operations on all columns/rows of a table and you will need to know how to extract/select the ones you want.

This section is organized per class because each class behaves differently. It is therefore essential to identify the class of an object. The mode, however, is not that important in this case.

There should be enough examples and exercises in this document to learn to manipulate all objects. But do not hesitate to run through the tutorial again. You should really get fluent with these aspects before you delve deeper into R.

9.1 Vectors

This section concerns manipulation of vectors of any mode, but also of vectors of dates and factors.

Create two vectors using the function : (for the help page, type ?":"), (1) a vector x containing the integer from 1 to 50, and (2) y with integers from 50 to 1. Display them into the console. According to you, what are the number in the square brackets?

Answer:

x <- 1:50
x
y <- 50:1
y

The number in square brackets indicate the position in the vector of the element displayed right after. For x, [16] means that the element displayed directly on its right (here the number 16) is the 16^th element of the vector x. This is meaningless in this case, but for y, the 16^th element is the number 35, which is less obvious. The positions are called indices. These indices are at the beginning of each line.
Lines are of varying length depending on the width of your console, so different indices will be displayed from one computer to another. But this does not alter the data.

Knowing this, how would you subset any element of a vector, for example element #47 of x and y?

Answer:

x[47]
y[47]

The single square bracket subsets one or several elements of a vector. It might not look like it, but the square bracket is actually a function (type ?"[" for the help page).

Usage: type the name of the object you want to subset followed by square brackets []. Between the brackets, give a vector of indices (or names, see below).

Exercises:

Subset elements #44 to 47 from x and y
Subset elements #44, 46 and 48 from x and y
Subset all elements from x and y but #45 (negative indices are excluded from subsetting)
Subset all elements from x and y but #1 and 45
Subset all elements from x and y but #1 to 3
Compute the mean of the first 45 elements of x and y
Compute the mean of x and y ignoring element #8

Answers:

x[44:47]
y[44:47]

x[c(44, 46, 48)]
y[c(44, 46, 48)]

x[-45]
y[-45]

x[-c(1, 45)]
y[-c(1, 45)]

x[-(1:3)] # Brackets are essential here; try without and understand the error!
y[-(1:3)]

mean(x[1:45])
mean(y[1:45])

mean(x[-8])
mean(y[-8])

Names can be associated to values; these names can then be used instead of indices. Start by displaying x and its structure to see the initial state: x, str(x).
Give names to each element of x (it is not possible to name only some elements): names(x) <- paste("n", 1:50, sep = ""). The function names() displays, sets or modifies the names of the elements of an object, and the function paste() combines characters together.
Check the results: x, str(x)
Subset value with the name ‘n5’ from x.
Subset values whose names are ‘n5’, ‘n6’ and ‘n7’ from ‘x’.

Answers:

x
str(x)
names(x) <- paste("n", 1:50, sep = "")
x
str(x)
x['n5']
x["n5"]
x[c("n5", "n6", "n7")] # It is not possible to use the function ":" to create a character sequence. 
                       # But you can use 'paste()'!

9.2 Matrices and arrays

Create a matrix mat filled with integers from 1 to 10, with 5 rows and 2 columns using the function matrix(), and display it in the console.

Using the output, try to understand how to subset a matrix.

Answer:

mat <- matrix(1:10, nrow = 5, ncol = 2)
mat

The single square bracket subsets rows/columns of a matrix.

Usage: type the name of the object you want to subset followed by square brackets. Between the square brackets, start by typing a comma [ , ] (if you do not do it automatically at the beginning, you may forget it; believe me, it happens often!). Within the brackets, give a vector of indices for rows before the comma, and a vector of indices for columns after the comma. If you want to subset all rows or columns, leave it blank before or after the comma, respectively.

Exercises:

Subset the value in the 2^nd row and 1^st column of ‘mat’
Subset the whole 2^nd row of ‘mat’, in other words, all the columns of the 2^nd row. Check the structure of the output.
Subset the whole 1^st column of ‘mat’, in other words, all the rows of the 1^st column. Check the structure of the output.
Subset the 2^nd and 3^rd rows and the 1^st column of ‘mat’. Check the structure of the output.
Subset rows 2-3 and columns 1-2 of ‘mat’. Check the structure of the output.
Subset all rows except the 2^nd, and the 1^st column of ‘mat’. Check the structure of the output.
Subset rows 1 and 3, and column 1 of ‘mat’. Check the structure of the output.
Name rows and columns of ‘mat’: dimnames(mat) <- list(paste("row", 1:5, sep = ""), paste("col", 1:2, sep = "")). Names (dimnames()) is assigned a list of 2 elements: one character vector for row names, and another one for column names.
Subset rows named ‘row1’ and ‘row3’ from ‘mat’. Check the structure of the output.
Create a vector named z containing 2 values: (1) the value of the 2^nd row and 1st column of ‘mat’, and (2) the value of the 3^rd row and 2^nd column of ‘mat’.
Create a matrix mat2 with 2 rows and 2 columns composed of the values (1) of the first two rows and the 1^st column of ‘mat’, and (2) of rows 3-4 and column 2 of ‘mat’.

Answers:

mat[2, 1]

mat[2, ]
str(mat[2, ]) # An integer vector

mat[, 1]
str(mat[, 1]) # An integer vector

mat[2:3, 1]
str(mat[2:3, 1]) # An integer vector

mat[2:3, 1:2]
mat[2:3, ]
str(mat[2:3, ]) # An integer matrix 
                # (the lengths of the 2 dimensions are given in the square brackets after 'int')

mat[-2, 1]
str(mat[-2, 1]) # An integer vector

mat[c(1, 3), 1]
str(mat[c(1, 3), 1]) # An integer vector

dimnames(mat) <- list(paste("row", 1:5, sep = ""), paste("col", 1:2, sep = ""))
mat[c("row1", "row3"), ]
str(mat[c("row1", "row3"), ]) # An integer matrix

z <- c(mat[2, 1], mat[3, 2])
z

mat2 <- matrix(c(mat[1:2, 1], mat[3:4, 2]), nrow = 2, ncol = 2)
mat2

If possible (1 single column/row), dimensions will be removed and the output will be given as a vector.

Create an array arr with integers 1-20, with 5 rows, 2 columns and 2 levels in the 3^rd dimension, using the function array(), and display it in the console.

Answer:

arr <- array(1:20, dim = c(5, 2, 2))
arr

Subset the 3^rd row, 1^st column, and 2^nd level of the 3^rd dimension of ‘arr’.

Answer:

arr[3, 1, 2]

It works like with matrices, but with more commas (except if you have a 1D array) to separate the indices of each dimension.

9.3 Lists

Create a list mylist composed of 3 elements: (1) a list composed of an integer vector with values 1-3 and a character vector with values a and b, (2) a matrix mat, and (3) a vector composed of the values 4, 5, 6d and 7. For this, use the function list().
Display the list and its structure.

Thanks to the output, try to understand how to subset values from a list.

Answer:

mylist <- list(list(1:3, c("a", "b")), mat, c(4, 5, "6d", 7))
mylist
str(mylist) # Note the hierarchical structure

The single square bracket extracts a list composed of one or several elements from a list. The double square bracket extracts one element from a list.

Usage: type the name of the object you want to subset followed by square brackets. Between double square brackets [[]], give the index of the element you want to extract. It is not possible to select several elements at once with double square brackets. To select several elements, the output must be a list (to make sure it always work, even if your specific case might not require it); you then have to use single square brackets instead of double ones. Between single square brackets [], give a vector of indices.

exercises:

Extract the 1^st element from ‘mylist’.
Extract a list containing the 1^st element of ‘mylist’.
Extract a list containing the 1^st and 3^rd elements of ‘mylist’.
Extract the 1^st element from the 1^st element of ‘mylist’.
Extract the 1^st element from the 1^st element of the 1^st element of ‘mylist’.
Extract the 1^st column from the 2^nd element of ‘mylist’.
Extract the 3^rd row from the 2^nd element of ‘mylist’.
Extract the 3^rd element from the 3^rd element of ‘mylist’.
Create a vector w containing 2 elements: (1) the 2^nd row and 1^st column from the 2^nd element of ‘mylist’, and (2) the 2^nd element from the 1^st element of the 1^st element of ‘mylist’.
Name the 3 elements of ‘mylist’, as well as the 2 elements of the 1st element of ‘mylist’:
names(mylist) <- c("List1", "MAT", "VEC") and names(mylist[[1]]) <- c("L1vec1", "L1vec2")
Extract the element named MAT from mylist.
Extract the element named L1vec1 from mylist.

Answers:

mylist[[1]]
mylist[1]
mylist[c(1, 3)]

mylist[[1]][[1]]
mylist[[1]][[1]][1]

mylist[[2]][, 1]
mylist[[2]][3, ]

mylist[[3]][3]

w <- c(mylist[[2]][2, 1], mylist[[1]][[1]][2])

names(mylist) <- c("L1", "MAT", "VEC")
names(mylist[[1]]) <- c("L1vec1", "L1vec2")
mylist
mylist[["MAT"]]
mylist[["L1"]][["L1vec1"]]

Lists are hierarchical and can store objects of any class. To know how to extract any element from a list, it is therefore necessary to know the class of this particular element. And this goes for any element of any level within the structure of the list. The required method to subset (i.e. single or double square brackets, number of commas…) depends on the class as seen previously.

The symbol $ can be substituted to double square brackets in a named list. You can then use the element’s names without quotations marks. Do not think that this method is better or easier. It is rarely used in scripts/functions that can be applied to many situations, but it is handy in an interactive session.

Examples:
mylist$MAT

     col1 col2
row1    1    6
row2    2    7
row3    3    8
row4    4    9
row5    5   10

mylist$L1$L1vec1

[1] 1 2 3

This article provides an overview of the differences between [, [[ and $.

9.4 Data.frames

Data.frames are some of the most used classes in R. They are particular lists, and because of this they have specific methods for subsetting. This therefore warrants a special section.

9.4.1 Simple subsetting

Create a data.frame mydf with 3 columns and 10 rows (whose names are to be “r1”, “r2”, …, “r10”) with the functiondata.frame(): (1) column A composed of integers 1 to 10, (2) column B with 10 random values sampled from a normal distribution (use the function rnorm()), and (3) column D with 5 “e” and 5 “f” (the function rep(..., each = 5) can be handy). Display the data.frame and its structure.

Answer:

mydf <- data.frame(A = 1:10, B = rnorm(10), D = rep(c("e", "f"), each = 5), 
                   row.names = paste("r", 1:10, sep = ""))
mydf
str(mydf)

To extract values from a data.frame, the list method is of course applicable. But it is also possible to use the matrix method, more flexible and more intuitive.

exercises:

Extract the 2^nd column of ‘mydf’ in four different ways.
Extract rows 2-3 of column 2 in three different ways.

Answers:

mydf[[2]]   # List method
mydf[, 2]   # Matrix method
mydf[["B"]] # Named list method
mydf[, "B"] # Named matrix method
mydf$B      # $ method

mydf[[2]][2:3]            # List method
mydf[2:3, 2]              # Matrix method
mydf[c("r2", "r3"), "B"]  # Named matrix method

From here on, there are no more exercises as such. You can then ‘show all code’ with the ‘Code’ button at the top right of the page (Back to top) in order to show the code by default. This will make things easier, and you will have color code!
Here is the legend to the colors: black = object, blue = character, light blue = comments, green = TRUE/FALSE, green bold = function, light green = number, red = argument.

But make sure you try out more than the examples given below!

9.4.2 Subsetting based on values in one or several columns or rows

You will probably often need to subset rows of a data.frame based on the values from a column.
For example, extract the rows with ‘e’ in column ‘D’.

The logic is not really intuitive but it is not complicated either. If you decompose it, it looks like: (1) in mydf, (2) select (3) rows (4) whose values are equal to (5) ‘e’ (6) in column ‘D’.
Written step by step, this gives:
(1) mydf
(2) mydf[ , ]
(3) mydf[rows, ]
(4) mydf[rows == value, ] # Equality is tested with ==, to differentiate it from the single = for arguments
(5) mydf[rows == "e", ]
(6) mydf[mydf$D == "e", ] # It is of course possible to substitute mydf$D with other methods (e.g. mydf[[3]], mydf[["D"]]).

mydf[mydf$D == "e", ]

   A          B D
r1 1  0.2319318 e
r2 2  0.3689124 e
r3 3  2.1507218 e
r4 4 -0.7026977 e
r5 5  1.1290876 e

To extract rows whose values are not ‘e’ in column ‘D’, use the symbol ‘different from’ !=:

mydf[mydf$D != "e", ]

     A          B D
r6   6  0.5111928 f
r7   7 -0.1690769 f
r8   8  1.6925666 f
r9   9  0.3234970 f
r10 10  0.1638051 f

To extract several values, you need to use the symbol %in% instead of ==. Examples:

mydf[mydf$A %in% c(1, 3, 5), ]

   A         B D
r1 1 0.2319318 e
r3 3 2.1507218 e
r5 5 1.1290876 e

mydf[row.names(mydf) %in% c("r1", "r10"), ]

     A         B D
r1   1 0.2319318 e
r10 10 0.1638051 f

To extract depending on values from several column, you need to use boolean operators AND (&, all conditions must be true) and OR (|, at least one condition must be true):

mydf[mydf$D == "e" & mydf$A %in% c(1, 3, 5), ]

   A         B D
r1 1 0.2319318 e
r3 3 2.1507218 e
r5 5 1.1290876 e

mydf[mydf$D == "e" | mydf$A %in% c(8, 10), ]

     A          B D
r1   1  0.2319318 e
r2   2  0.3689124 e
r3   3  2.1507218 e
r4   4 -0.7026977 e
r5   5  1.1290876 e
r8   8  1.6925666 f
r10 10  0.1638051 f

These manipulation obviously work the same way on columns (after the comma between the single square brackets).

9.4.3 Add columns and do maths on columns

To add a column to ‘mydf’, you just need to assign a vector to an extra column:

mydf[[4]] <- rnorm(10)
mydf

     A          B D         V4
r1   1  0.2319318 e  1.2093829
r2   2  0.3689124 e  0.8456566
r3   3  2.1507218 e -0.9094577
r4   4 -0.7026977 e -0.6716173
r5   5  1.1290876 e  1.5026983
r6   6  0.5111928 f -0.5668415
r7   7 -0.1690769 f  0.2250237
r8   8  1.6925666 f -0.2985337
r9   9  0.3234970 f  0.4151644
r10 10  0.1638051 f -0.5843548

You can also directly name this extra column:

mydf[["E"]] <- rnorm(10)
mydf

     A          B D         V4          E
r1   1  0.2319318 e  1.2093829 -0.5513008
r2   2  0.3689124 e  0.8456566 -0.1602728
r3   3  2.1507218 e -0.9094577  0.4213416
r4   4 -0.7026977 e -0.6716173 -0.5856752
r5   5  1.1290876 e  1.5026983  1.1115035
r6   6  0.5111928 f -0.5668415 -1.1195179
r7   7 -0.1690769 f  0.2250237 -0.3262478
r8   8  1.6925666 f -0.2985337  0.2864604
r9   9  0.3234970 f  0.4151644  0.4986364
r10 10  0.1638051 f -0.5843548  1.9801920

The length of the vector should of course be equal to the number of rows of the data.frame.

To add a column equal to e.g. the difference between columns 1 and 2, you just need to do that:

mydf[[6]] <- mydf[[1]] - mydf[[2]]
mydf

     A          B D         V4          E        V6
r1   1  0.2319318 e  1.2093829 -0.5513008 0.7680682
r2   2  0.3689124 e  0.8456566 -0.1602728 1.6310876
r3   3  2.1507218 e -0.9094577  0.4213416 0.8492782
r4   4 -0.7026977 e -0.6716173 -0.5856752 4.7026977
r5   5  1.1290876 e  1.5026983  1.1115035 3.8709124
r6   6  0.5111928 f -0.5668415 -1.1195179 5.4888072
r7   7 -0.1690769 f  0.2250237 -0.3262478 7.1690769
r8   8  1.6925666 f -0.2985337  0.2864604 6.3074334
r9   9  0.3234970 f  0.4151644  0.4986364 8.6765030
r10 10  0.1638051 f -0.5843548  1.9801920 9.8361949

R will apply the subtraction row by row.
Here too, it is possible to name the new column directly during its assignment as shown above.

Subsetting methods are interchangeable ($, [[index]], [["name"]], [ , ]).

The preceding exercises should be enough to show you the different methods for subsetting a data.frame. But you should create new data.frames and try to extract any part so that you can subset easily. Practice is key here!

9.4.4 Reorganize columns and rows

To reorganize columns, you just need to specify the column order in a subsetting command.
For example, if you want columns D, A, B, V4, E, V6:

mydf2 <- mydf[, c("D", "A", "B", "V4", "E", "V6")]
mydf2

    D  A          B         V4          E        V6
r1  e  1  0.2319318  1.2093829 -0.5513008 0.7680682
r2  e  2  0.3689124  0.8456566 -0.1602728 1.6310876
r3  e  3  2.1507218 -0.9094577  0.4213416 0.8492782
r4  e  4 -0.7026977 -0.6716173 -0.5856752 4.7026977
r5  e  5  1.1290876  1.5026983  1.1115035 3.8709124
r6  f  6  0.5111928 -0.5668415 -1.1195179 5.4888072
r7  f  7 -0.1690769  0.2250237 -0.3262478 7.1690769
r8  f  8  1.6925666 -0.2985337  0.2864604 6.3074334
r9  f  9  0.3234970  0.4151644  0.4986364 8.6765030
r10 f 10  0.1638051 -0.5843548  1.9801920 9.8361949

Or with indices:

mydf3 <- mydf[, c(3, 1, 2, 4:6)]
mydf3

    D  A          B         V4          E        V6
r1  e  1  0.2319318  1.2093829 -0.5513008 0.7680682
r2  e  2  0.3689124  0.8456566 -0.1602728 1.6310876
r3  e  3  2.1507218 -0.9094577  0.4213416 0.8492782
r4  e  4 -0.7026977 -0.6716173 -0.5856752 4.7026977
r5  e  5  1.1290876  1.5026983  1.1115035 3.8709124
r6  f  6  0.5111928 -0.5668415 -1.1195179 5.4888072
r7  f  7 -0.1690769  0.2250237 -0.3262478 7.1690769
r8  f  8  1.6925666 -0.2985337  0.2864604 6.3074334
r9  f  9  0.3234970  0.4151644  0.4986364 8.6765030
r10 f 10  0.1638051 -0.5843548  1.9801920 9.8361949

It works the same way to reorganize rows.

The function order() sorts a data.frame (i.e. reorganize rows according to values from one or several columns).
For example, to sort mydf2 based on increasing values from column D (i.e. alphabetic order) and then on decreasing values from column A:

mydf2[order(mydf2$D, -mydf2$A), ]

    D  A          B         V4          E        V6
r5  e  5  1.1290876  1.5026983  1.1115035 3.8709124
r4  e  4 -0.7026977 -0.6716173 -0.5856752 4.7026977
r3  e  3  2.1507218 -0.9094577  0.4213416 0.8492782
r2  e  2  0.3689124  0.8456566 -0.1602728 1.6310876
r1  e  1  0.2319318  1.2093829 -0.5513008 0.7680682
r10 f 10  0.1638051 -0.5843548  1.9801920 9.8361949
r9  f  9  0.3234970  0.4151644  0.4986364 8.6765030
r8  f  8  1.6925666 -0.2985337  0.2864604 6.3074334
r7  f  7 -0.1690769  0.2250237 -0.3262478 7.1690769
r6  f  6  0.5111928 -0.5668415 -1.1195179 5.4888072

Note that the function sort() works only on vectors.

The base R way of doing things get quickly complicated. In many cases, the functions from the package dplyr make your life easier but they cannot use indices to subset data.frames, so they are not always better. See R for Data Science for details.

10 Summary

10.1 Modes, classes, data objects, functions and arguments

It is important to be able to easily identify modes and classes, as well as to differentiate between data objects, functions and arguments, in order to be able to use R.

Here is a summary:

Modes : if the output is quoted, then it is character. If not, if it is made of numbers, then it is numeric; if it is TRUE/FALSE, then it is logical. Pretty easy, right? The function mode() can also be handy.
Classes : the output should allow you to identify the different classes. Check the previous examples. The functions class() and especially str() should be used without moderation.
Data objects : their names are unquoted character strings and they contain data.
Functions : they are also objects, but special ones. If you type the name of a function, you will display the sequence of operations that this function will run:
mean
But in general, you want to run these operations. In this case, use brackets to group and set arguments:
mean(1:10)
Brackets are therefore essential to identify functions.
Arguments : arguments are the options/settings of a function. They are all encased within the function’s brackets, are separated by commas, and the symbol = assigns values to arguments. They can be used through their names or in the ordered defined in the function.

Commas therefore have only one use in R: they separate arguments of functions ([ and [[ are actually functions).
I recommend again to use = only for arguments; favor <- to perform assignment.

10.2 Brackets, square brackets and braces

Each type has a specific use.

Brackets (): they enclose the function’s call, and as such, the arguments. Even if no argument is necessary, do not forget the brackets! So brackets = function and vice-versa!
Square brackets []: a function shortcut to subset an object. Depending on the class of the object and on the desired class of the output, single [] or double [[]] square brackets are required. To get help on them, type ?"[" in the console. So subset = square brackets and vice-versa!
Braces {}: they are used to group several commands, especially to define functions (see section How to write a function) and for control-flow constructs (see section Control-flow constructs).

Sections 11-16 go a big step forward. Be sure that you have mastered the concepts of sections 6-9 before you continue. For the following sections, you will probably also need more documentation and practice than what is offered here. And you will definitely need to check the help pages of several functions, because I assume that you can now do it.
So let us say the following sections of this tutorial are just very rough introductions to the concepts addressed!

11 Reading data into R

Before going into statistics and graphics, let us start with the most important (and difficult) requirement: getting your data into R!

Importing data into R is always a difficult, yet crucial, step. Data of any sort can be imported into R, but let us focus on tables containing variables in columns and samples in rows. For example:

   let         num
1    a  0.97709397
2    a -0.07401789
3    a -0.09351858
4    a  0.23937810
5    a  0.18704214
6    b  0.41971483
7    b  0.41105680
8    b  1.28964740
9    b -0.33542246
10   b  1.09085699

Most of the time, different columns contain data of different modes. Therefore, we are talking about data.frames here.

Tables can be imported into data.frames from different formats. Base R can read in text (*.txt) and comma-separated values (*.csv) files; the values can be separated by commas (standard CSV), semi-colon (CSV with comma as a decimal point), space or tabulation (TXT)…
The functions to read in these files are read.table() for TXT files and read.csv() or read.csv2() for CSV files with comma and semi-colon as field separator, respectively. The latter two functions are based on read.table(); just the default arguments are different.

11.1 Create some data files

Obviously, in a real setting, you will have data that you import in R. But in our case, we first need to have data to work with. So let us create some data files.

We will look at the functions to read in the data later but we will use here the equivalent function to write the data to files.

For now, run the code below to create some data:

# Set seed so that we all get the same "random" values
set.seed(123)

# Create vectors of data with 10 elements each
my_sample <- paste("MON", 1:10, sep = "-")
my_layer <- rep(c("A", "B"), each = 5)
my_type <- rep(c("flake","core"), 5)
my_length <- rnorm(10, 5) 
my_width <- rnorm(10, 3) 
my_thickness <- rnorm(10, 1) 

# Combine into a data.frame
my_df <- data.frame(Sample = my_sample, Layer = my_layer, Type = my_type, Length = my_length,
                    Width = my_width, Thickness = my_thickness)
my_df

   Sample Layer  Type   Length    Width   Thickness
1   MON-1     A flake 4.439524 4.224082 -0.06782371
2   MON-2     A  core 4.769823 3.359814  0.78202509
3   MON-3     A flake 6.558708 3.400771 -0.02600445
4   MON-4     A  core 5.070508 3.110683  0.27110877
5   MON-5     A flake 5.129288 2.444159  0.37496073
6   MON-6     B  core 6.715065 4.786913 -0.68669331
7   MON-7     B flake 5.460916 3.497850  1.83778704
8   MON-8     B  core 3.734939 1.033383  1.15337312
9   MON-9     B flake 4.313147 3.701356 -0.13813694
10 MON-10     B  core 4.554338 2.527209  2.25381492

str(my_df)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

Now let us write this data.frame to three files in three formats: TXT, CSV and XLSX.

# Create folder "Data" in the current working directory
# You will get a warning if this folder already exists
dir.create("Data")

Warning in dir.create("Data"): 'Data' already exists

# Write to TXT with tabs as separator ("\t")
write.table(my_df, file = "Data/Data_tutorial.txt", sep = "\t", row.names = FALSE)

# Write to CSV
write.csv(my_df, file = "Data/Data_tutorial.csv", row.names = FALSE)

# Write to XLSX using package "writexl"
writexl::write_xlsx(my_df, path = "Data/Data_tutorial.xlsx")

# Write to ODS using package "readODS"
readODS::write_ods(my_df, path = "Data/Data_tutorial.ods")

Note that the files have been saved in a newly created “Data” folder within the working directory (see getwd() and section Data, scripting, projects and repeatability). Keep them there and do not change the working directory. If you do, make sure you edit the file/path argument of the functions so that it includes the whole path to the file.

The :: in the last two lines indicate that the function write_xlsx() and write_ods() are included in the package writexl and readODS respectively. Writing it that way avoids the need to load the package via library(writexl) and library(readODS).

11.2 `read.table()`

First, its most important arguments:

file: the file to read, with full path (or better, relative path, see section Data, scripting, projects and repeatability)
header: whether the columns have headers (default=FALSE for read.table(), and TRUE for read.csv() and read.csv2())
sep: field separator character, usually space or tab for TXT files, comma/semi-colon for CSV files
dec: the character used in the file for decimal points (default=. for read.table() and read.csv(), and , for read.csv2())
colClasses: a vector of classes to be assumed for the columns; possible values: NA (R will do it himself, default), logical, integer, numeric, character, factor, Date…
skip: the number of lines of the data file to skip before beginning to read data (default=0)
stringsAsFactors: whether character vectors should be converted to factors (default=FALSE from R 4.0.0)

Now, some examples using the data files provided.

First with ‘Data_tutorial.txt’. This is the code to read it in:

df_txt1 <- read.table("Data/Data_tutorial.txt", header = TRUE, sep = "\t", 
                     colClasses = c(rep("character", 3), rep("numeric", 3)))

It has headers, it is tab-separated and the columns 1-3 should be characters, while columns 4-6 should be numeric. The default values for the other arguments are appropriate.
Let us check the results (str() is crucial here because data might be of a different mode than planned if there were some unwanted characters in the file):

df_txt1

   Sample Layer  Type   Length    Width   Thickness
1   MON-1     A flake 4.439524 4.224082 -0.06782371
2   MON-2     A  core 4.769823 3.359814  0.78202509
3   MON-3     A flake 6.558708 3.400771 -0.02600445
4   MON-4     A  core 5.070508 3.110683  0.27110877
5   MON-5     A flake 5.129288 2.444159  0.37496073
6   MON-6     B  core 6.715065 4.786913 -0.68669331
7   MON-7     B flake 5.460916 3.497850  1.83778704
8   MON-8     B  core 3.734939 1.033383  1.15337312
9   MON-9     B flake 4.313147 3.701356 -0.13813694
10 MON-10     B  core 4.554338 2.527209  2.25381492

str(df_txt1)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

R is usually able to find out the best types/classes on its own:

df_txt2 <- read.table("Data/Data_tutorial.txt", header = TRUE, sep = "\t")
identical(df_txt1, df_txt2)

[1] TRUE

11.3 `read.csv()`

Then with ‘Data_tutorial.csv’. This is the code to read it in:

df_csv <- read.csv("Data/Data_tutorial.csv", header = TRUE)

It has headers, it is comma separated (so read.csv(), rather than read.csv2(), is appropriate) and the columns 1-3 should be factors while columns 4-6 should be numeric. The default values for the other arguments are appropriate.
Let us check the results:

df_csv

   Sample Layer  Type   Length    Width   Thickness
1   MON-1     A flake 4.439524 4.224082 -0.06782371
2   MON-2     A  core 4.769823 3.359814  0.78202509
3   MON-3     A flake 6.558708 3.400771 -0.02600445
4   MON-4     A  core 5.070508 3.110683  0.27110877
5   MON-5     A flake 5.129288 2.444159  0.37496073
6   MON-6     B  core 6.715065 4.786913 -0.68669331
7   MON-7     B flake 5.460916 3.497850  1.83778704
8   MON-8     B  core 3.734939 1.033383  1.15337312
9   MON-9     B flake 4.313147 3.701356 -0.13813694
10 MON-10     B  core 4.554338 2.527209  2.25381492

str(df_csv)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

Both functions produce the same results:

identical(df_txt1, df_csv)

[1] TRUE

11.4 Reading XLS or XLSX files directly

While CSV is the best choice for open formats, it has limitations (single-sheet files only). XLSX is a good alternative in that it is kind of an open format (but see here).
There are also of course some functions to read in XLS or XLSX files directly, but these are not part of base R.

One of the best is the function read_excel() from the package readxl. It can read both XLS and XLSX files. The corresponding function writexl::write_xlsx() can write to XLS/XLSX files (we used it before, see section Create some data files).
It is also very straightforward to use and does not require any other packages, plug-in or external functionalities:

library(readxl)
df_xlsx_readxl <- readxl::read_excel("Data/Data_tutorial.xlsx")
df_xlsx_readxl

# A tibble: 10 × 6
   Sample Layer Type  Length Width Thickness
   <chr>  <chr> <chr>  <dbl> <dbl>     <dbl>
 1 MON-1  A     flake   4.44  4.22   -0.0678
 2 MON-2  A     core    4.77  3.36    0.782 
 3 MON-3  A     flake   6.56  3.40   -0.0260
 4 MON-4  A     core    5.07  3.11    0.271 
 5 MON-5  A     flake   5.13  2.44    0.375 
 6 MON-6  B     core    6.72  4.79   -0.687 
 7 MON-7  B     flake   5.46  3.50    1.84  
 8 MON-8  B     core    3.73  1.03    1.15  
 9 MON-9  B     flake   4.31  3.70   -0.138 
10 MON-10 B     core    4.55  2.53    2.25

str(df_xlsx_readxl)

tibble [10 × 6] (S3: tbl_df/tbl/data.frame)
 $ Sample   : chr [1:10] "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr [1:10] "A" "A" "A" "A" ...
 $ Type     : chr [1:10] "flake" "core" "flake" "core" ...
 $ Length   : num [1:10] 4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num [1:10] 4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num [1:10] -0.0678 0.782 -0.026 0.2711 0.375 ...

There are some noteworthy differences though:

The header argument is replaced by col_names.
The colClasses argument is replaced by col_types. Note that factor is not a valid value anymore, i.e. character columns will be read as text (character) without an option to convert to factors directly.
There is a new sheet argument (XLS(X) files can contain several sheets, CSV cannot).

Note that for some functions, it might be needed to convert the output to a ‘pure’ data.frame (read_excel() reads into a tibble, i.e. an object of classes tbl_df, tbl and data.frame at the same time):

df_xlsx_readxl_df <- data.frame(df_xlsx_readxl)
str(df_xlsx_readxl_df)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

Check the RStudio Cheat Sheet on Data import with the tidyverse.

Another interesting function is read_xlsx() from the package openxlsx2. It cannot read XLS files, only XLSX (but you should prefer XLSX files anyway because they are more or less open format).
It is possible to specify the classes of the columns with the argument types although it is quite cumbersome.
As readxl::read_excel(), it cannot convert characters to factors.
The package also contains the function write_xlsx() to save an object to an XLSX file.

Here is how to use it:

library(openxlsx2)
df_xlsx_openxlsx <- openxlsx2::read_xlsx("Data/Data_tutorial.xlsx")
df_xlsx_openxlsx

   Sample Layer  Type   Length    Width   Thickness
2   MON-1     A flake 4.439524 4.224082 -0.06782371
3   MON-2     A  core 4.769823 3.359814  0.78202509
4   MON-3     A flake 6.558708 3.400771 -0.02600445
5   MON-4     A  core 5.070508 3.110683  0.27110877
6   MON-5     A flake 5.129288 2.444159  0.37496073
7   MON-6     B  core 6.715065 4.786913 -0.68669331
8   MON-7     B flake 5.460916 3.497850  1.83778704
9   MON-8     B  core 3.734939 1.033383  1.15337312
10  MON-9     B flake 4.313147 3.701356 -0.13813694
11 MON-10     B  core 4.554338 2.527209  2.25381492

str(df_xlsx_openxlsx)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

Both functions produce the same results as read.table() and read.csv(); only default row.names are different with openxlsx2::read_xlsx():

all.equal(df_txt1, df_xlsx_readxl_df)

[1] TRUE

all.equal(df_txt1, df_xlsx_openxlsx)

[1] "Attributes: < Component \"row.names\": Mean relative difference: 0.1818182 >"

Note that comparing with identical() will return FALSE, probably because of floating-point arithmetic (see R FAQ 7.31).

Note that I use the calls readxl::read_excel() and openxlsx2::read_xlsx() because the package readxl also contains a function called read_xlsx() that is masked by openxlsx2::read_xlsx() when the package openxlsx2 is loaded.

In all cases, if you want to convert all character columns to factors, you can run this code:

# Identify which columns are of mode character and get their indices
char1 <- which(sapply(df_xlsx_readxl_df, is.character))

# Convert each of these columns to factor
df_xlsx_readxl_df[, char1] <- lapply(char1, function(x) factor(df_xlsx_readxl_df[[x]]))

# Check the results
str(df_xlsx_readxl_df)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : Factor w/ 10 levels "MON-1","MON-10",..: 1 3 4 5 6 7 8 9 10 2
 $ Layer    : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
 $ Type     : Factor w/ 2 levels "core","flake": 2 1 2 1 2 1 2 1 2 1
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

11.5 Reading ODS files directly

Because XLSX is not really an open format, you should prefer true open formats like ODS, which can be opened with LibreOffice for example.
In R, the package readODS, which actually also includes the function write_ods() to write to an ODS file (we used it before, see section Create some data files), works well.
It is not as efficient as readxl/writexl, so it might be an issue with large data files, although it seems to have improved quite recently (see e.g. issues #49 and #71).
Unlike the functions mentioned above to read XLSX files, the function read_ods() can convert directly to factors, if this is wanted.

Here is how to use it:

library(readODS)
df_ods <- read_ods("Data/Data_tutorial.ods")
df_ods

# A tibble: 10 × 6
   Sample Layer Type  Length Width Thickness
   <chr>  <chr> <chr>  <dbl> <dbl>     <dbl>
 1 MON-1  A     flake   4.44  4.22   -0.0678
 2 MON-2  A     core    4.77  3.36    0.782 
 3 MON-3  A     flake   6.56  3.40   -0.0260
 4 MON-4  A     core    5.07  3.11    0.271 
 5 MON-5  A     flake   5.13  2.44    0.375 
 6 MON-6  B     core    6.72  4.79   -0.687 
 7 MON-7  B     flake   5.46  3.50    1.84  
 8 MON-8  B     core    3.73  1.03    1.15  
 9 MON-9  B     flake   4.31  3.70   -0.138 
10 MON-10 B     core    4.55  2.53    2.25

str(df_ods)

tibble [10 × 6] (S3: tbl_df/tbl/data.frame)
 $ Sample   : chr [1:10] "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr [1:10] "A" "A" "A" "A" ...
 $ Type     : chr [1:10] "flake" "core" "flake" "core" ...
 $ Length   : num [1:10] 4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num [1:10] 4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num [1:10] -0.0678 0.782 -0.026 0.2711 0.375 ...

By default, the output is a tibble, just like with readxl::read_excel(). To output to a ‘pure’ data.frame, simply change the argument as_tibble to FALSE:

df_ods_df <- read_ods("Data/Data_tutorial.ods", as_tibble = FALSE)
str(df_ods_df)

'data.frame':   10 obs. of  6 variables:
 $ Sample   : chr  "MON-1" "MON-2" "MON-3" "MON-4" ...
 $ Layer    : chr  "A" "A" "A" "A" ...
 $ Type     : chr  "flake" "core" "flake" "core" ...
 $ Length   : num  4.44 4.77 6.56 5.07 5.13 ...
 $ Width    : num  4.22 3.36 3.4 3.11 2.44 ...
 $ Thickness: num  -0.0678 0.782 -0.026 0.2711 0.375 ...

Again, same results as before:

identical(df_txt1, df_ods_df) # data.frames

[1] TRUE

all.equal(df_xlsx_readxl, df_ods) # tibbles

[1] TRUE

12 Descriptive statistics

Only descriptive statistics will be addressed here. Analytical/inferential statistics require a deeper understanding of statistics that is way beyond the scope of this tutorial, as well as way beyond my knowledge. But if you know how to use R and what test you need to apply (ask a statistician for that!), then you should not have any problem applying this test in R.

This section will cover the basic descriptive statistics: sample size, mean, median, mode, quantiles, min/max, variance, and standard deviation.

12.1 Sample size

Sample size is crucial: the interpretations, and their strength, depend a lot on how many samples have been studied. But sample size is too often unreported, especially in studies with a limited number of samples, which is somewhat ironical.

In R, it is very easy to compute the sample size: basically, you just need to count the number of samples in a column of your data.
Using the example data above, you need to know how many elements the column Sample of the data.frame df_csv (or any of the above data.frames) has. This actually works on any column, because all elements of a data.frame (= columns) must have the same length. This is done with the function length():

length(df_csv[["Sample"]])

[1] 10

Alternatively, you can also count the number of rows of your data.frame:

nrow(df_csv)

[1] 10

Of course, in most cases, what is really important is the number of samples per group. In our case, we need to know how many flakes and cores per layer.
Because this question is relevant to all descriptive statistics, it is addressed in section Aggregating.

12.2 Central tendency

In statistics, central tendency refers to a central or typical value for a dataset. This is what normal people would call ‘average’. But there are many ways to describe what this central tendency is. Here I just summarize the main ones. It is not redundant to use them all on the same dataset; on the contrary, each measure corresponds to a different property of the data and as such they complement each other very well.

The arithmetic mean is what people usually have in mind when talking about the average. It is the sum of values of a dataset divided by the number of values.
It is calculated as follows: \[\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}\]
In R, it is computed with the function mean():

mean(df_csv[["Length"]])

[1] 5.074626

The median is the middle value separating the dataset in two halves with equal numbers of values (samples). It is calculated by sorting the values into ascending order and selecting the value in the middle.
For example with 5 ordered values “1, 1, 2, 5, 6”, the 3^rd value (2) is the median, as there are 2 smaller and 2 larger values.
This works well for an odd number of values, but in case of an even number of values, the median is the arithmetic mean of the two middle values.

In R, the function median() does just that:

median(df_csv[["Length"]])

[1] 4.920165

Because there are 10 values, the median is the mean of the 5^th (4.7698225) and 6^th (5.0705084) values, that is, 4.9201655.

The mode is the most frequent value in a dataset. It is not often useful, probably explaining why R does not have a built-in function to do it (see this thread for custom functions to calculate the mode). In our case, each value appears only once so it would be completely useless.

These plots make the distinction between mean, median and mode easier to understand.

12.3 Variability

Complementary to the central tendency, a measure of the dispersion around the central tendency is also crucial.
Both variance and standard deviation measure the dispersion of the values around the mean.

The (unbiased) sample variance is the sum of the squared differences between each value ($x_{i}$) and the mean ($\bar{x}$), divided by sample size ($n$) - 1: \[Var({x}) = \sigma^2({x}) = \frac{1}{n-1} \sum_{i=1}^{n}(x_{i} - \bar{x})^2\]

The (corrected) sample standard deviation is the square root of the (unbiased) sample variance: \[\sigma({x}) = \sqrt{\sigma^2({x})}\]

In R, the functions are var() and sd(), respectively:

var(df_csv[["Length"]])

[1] 0.909704

sd(df_csv[["Length"]])

[1] 0.9537841

See also section Aggregating to learn how to compute these values per group.

12.4 Distribution

Minimum and maximum values are simple, yet valuable, statistics.

min(df_csv[["Length"]])

[1] 3.734939

max(df_csv[["Length"]])

[1] 6.715065

range(df_csv[["Length"]])

[1] 3.734939 6.715065

The n^th quantile (or the n% percentile) of a dataset is the value that cuts off the first n percent of the data values when it is sorted in ascending order. The 50^th quantile is the median. Quantiles give an overview of how the data are distributed across the range.
The R function quantile() can compute any quantile. By default it computes the quartiles (i.e. 0, 25, 50, 75 and 100^th quantiles; or 0, 25, 50, 75 and 100% percentiles), but it can be adjusted as needed:

quantile(df_csv[["Length"]])

      0%      25%      50%      75%     100% 
3.734939 4.468228 4.920165 5.378009 6.715065

quantile(df_csv[["Length"]], probs = c(0, 0.33, 0.67, 1))

      0%      33%      67%     100% 
3.734939 4.550894 5.139237 6.715065

The interquartile range is the distance between the 1^st and 3^rd quartiles:

IQR(df_csv[["Length"]])

[1] 0.9097813

Skewness is a measure of the asymmetry of the distribution about the mean, while the kurtosis is a measure of the tailedness of the distribution.

The package moments offers functions to calculate these values:

library(moments)
skewness(df_csv[["Length"]])

[1] 0.6033432

kurtosis(df_csv[["Length"]])

[1] 2.374519

The summary() function computes the most important statistics at once (“Qu.” in the output is the abbreviation for “quartile”):

summary(df_csv[["Length"]])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.735   4.468   4.920   5.075   5.378   6.715

See also section Aggregating to learn how to compute these values per group.

12.5 Aggregating

As mentioned before, all of these statistics make sense only when calculated by group. It is of course possible to apply the functions to subsets of values (cf. section 9.4 Data.frames) for each level of the grouping variables, e.g.:

mean(df_csv[df_csv$Layer == "A" & df_csv$Type == "flake", "Length"])

[1] 5.37584

But R can do more than that, and it is called aggregating.

Before I go on with base R, just a side note on some very popular packages to achieve these tasks. The packages dplyr and data.table for example offer a whole range of functions that are very efficient. dplyr is sometimes useful and is part of the tidyverse environment, which also makes use of pipes (see section Pipes).

In good old base R, the function aggregate() is used to aggregate.

With our case before, let us first compute the mean of the length for each level of “Layer”. This shows how to use aggregate():

aggregate(df_csv[["Length"]], by = df_csv["Layer"], FUN = mean)

  Layer        x
1     A 5.193570
2     B 4.955681

Arguments to the function (in that case, ‘mean’) can be added too if needed, e.g.:

aggregate(df_csv[["Length"]], by = df_csv["Layer"], FUN = mean, trim = 0.5)

  Layer        x
1     A 5.070508
2     B 4.554338

Then, let us compute the mean of the length for each combination of “Layer” and “Type”:

aggregate(df_csv[["Length"]], by = list(df_csv[["Layer"]], df_csv[["Type"]]), FUN = mean)

  Group.1 Group.2        x
1       A    core 4.920165
2       B    core 5.001447
3       A   flake 5.375840
4       B   flake 4.887032

Note that the syntax gets complicated. Because the argument by needs a list, we used only one square bracket in the first example, but two square brackets within the list() call in the second example.

You can also do it that way:

aggregate(df_csv[["Length"]], by = df_csv[, c("Layer", "Type")], FUN = mean)

  Layer  Type        x
1     A  core 4.920165
2     B  core 5.001447
3     A flake 5.375840
4     B flake 4.887032

It is also possible to compute the mean for length, width and thickness for each level of “Layer” and “Type”, all at once:

aggregate(df_csv[c("Length", "Width", "Thickness")], by = list(df_csv[["Layer"]], df_csv[["Type"]]), 
          FUN = mean)

  Group.1 Group.2   Length    Width  Thickness
1       A    core 4.920165 3.235248 0.52656693
2       B    core 5.001447 2.782502 0.90683158
3       A   flake 5.375840 3.356337 0.09371086
4       B   flake 4.887032 3.599603 0.84982505

In these cases, the names of the output columns are not very meaningful, but it can be improved if you change the notation of the by argument:

aggregate(df_csv[c("Length","Width","Thickness")], by = df_csv[, c("Layer", "Type")], FUN = mean)

  Layer  Type   Length    Width  Thickness
1     A  core 4.920165 3.235248 0.52656693
2     B  core 5.001447 2.782502 0.90683158
3     A flake 5.375840 3.356337 0.09371086
4     B flake 4.887032 3.599603 0.84982505

As you can see, the selection of columns implies a lot of typing. The best way to make a shorter and nicer code is to use the so-called formula notation (for details, see ?formula and the section of the help page for aggregate() related to the S3 method for class ‘formula’).
A formula gives you an easy and intuitive (IMHO) syntax to specify the columns of a data.frame. It can be used for aggregating, as shown here, but also to specify linear models (but I will not address this topic). It is constructed as follows:

It has two parts, separated by the tilde ~.
The left side of the formula lists the numerical variables on which to apply the function, bound together with cbind().
The right side of the formula lists the grouping variables, separated by +.
On each side, the variables are unquoted.
The argument data is the input data.frame in which the column names will be looked for.

This is the code:

aggregate(cbind(Length, Width, Thickness) ~ Layer + Type, data = df_csv, FUN = mean)

  Layer  Type   Length    Width  Thickness
1     A  core 4.920165 3.235248 0.52656693
2     B  core 5.001447 2.782502 0.90683158
3     A flake 5.375840 3.356337 0.09371086
4     B flake 4.887032 3.599603 0.84982505

Notice that the names of columns are informative.

The dot . symbol has a special use within a formula: it is used to select all columns not otherwise listed in the formula. This can be very handy:

aggregate(.~ Layer + Type, data = df_csv, FUN = mean)

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

  Layer  Type Sample Length Width Thickness
1     A  core     NA     NA    NA        NA
2     B  core     NA     NA    NA        NA
3     A flake     NA     NA    NA        NA
4     B flake     NA     NA    NA        NA

In that case though, the mean has also been applied to the column “Sample” (which is a factor and can therefore be coerced to numeric), which is nonsense. The easiest way would be to unselect it in the data argument:

aggregate(.~ Layer + Type, data = df_csv[-1], FUN = mean)

  Layer  Type   Length    Width  Thickness
1     A  core 4.920165 3.235248 0.52656693
2     B  core 5.001447 2.782502 0.90683158
3     A flake 5.375840 3.356337 0.09371086
4     B flake 4.887032 3.599603 0.84982505

An alternative I like is the function doBy::summaryBy(), which works exactly like aggregate(), but with a nicer output in some cases:

doBy::summaryBy(.~ Layer + Type, data = df_csv[-1], FUN = mean)

  Layer  Type Length.mean Width.mean Thickness.mean
1     A  core    4.920165   3.235248     0.52656693
2     A flake    5.375840   3.356337     0.09371086
3     B  core    5.001447   2.782502     0.90683158
4     B flake    4.887032   3.599603     0.84982505

The column names are, in my opinion, more informative and the sorting is different (flakes and cores for layer A first, and then for layer B, i.e. the order provided in the formula).

Using dplyr is also quite straightforward, especially when combined with pipes (see section Pipes). It works by first grouping the data based on variables using group_by(). Then, summarize() applies the function to each group. Here we specify that it should apply the mean to all numeric columns using the across(where(is.numeric), mean) construct. Last, the output is converted from a tibble to a data.frame to better compare the results between the different functions:

library(dplyr)
df_csv %>%
  group_by(Layer, Type) %>%
  summarize(across(where(is.numeric), mean)) %>%
  as.data.frame()

  Layer  Type   Length    Width  Thickness
1     A  core 4.920165 3.235248 0.52656693
2     A flake 5.375840 3.356337 0.09371086
3     B  core 5.001447 2.782502 0.90683158
4     B flake 4.887032 3.599603 0.84982505

Check the RStudio Cheat Sheet on R Syntax comparison.

Note that it is not possible to specify several functions in the FUN argument of aggregate. One workaround is writing your own function that applies several functions (see section Another example):

aggregate(.~ Layer + Type, data = df_csv[-1], FUN = mean.sd)

  Layer  Type Length.Mean Length.SD Width.Mean  Width.SD Thickness.Mean
1     A  core   4.9201655 0.2126170  3.2352483 0.1761623     0.52656693
2     B  core   5.0014473 1.5395513  2.7825015 1.8897429     0.90683158
3     A flake   5.3758401 1.0808914  3.3563374 0.8907930     0.09371086
4     B flake   4.8870317 0.8115953  3.5996032 0.1439001     0.84982505
  Thickness.SD
1   0.36127239
2   1.48567635
3   0.24446540
4   1.39718925

Another workaround is using doBy::summaryBy() because it can apply a list of functions:

doBy::summaryBy(.~ Layer + Type, data = df_csv[-1], FUN = list(mean, sd))

  Layer  Type Length.mean Width.mean Thickness.mean Length.sd  Width.sd
1     A  core    4.920165   3.235248     0.52656693 0.2126170 0.1761623
2     A flake    5.375840   3.356337     0.09371086 1.0808914 0.8907930
3     B  core    5.001447   2.782502     0.90683158 1.5395513 1.8897429
4     B flake    4.887032   3.599603     0.84982505 0.8115953 0.1439001
  Thickness.sd
1    0.3612724
2    0.2444654
3    1.4856763
4    1.3971892

dplyr::summarize() can do it too:

df_csv %>%
  group_by(Layer, Type) %>%
  summarize(across(where(is.numeric), list(mean = mean, sd = sd))) %>%
  as.data.frame()

  Layer  Type Length_mean Length_sd Width_mean  Width_sd Thickness_mean
1     A  core    4.920165 0.2126170   3.235248 0.1761623     0.52656693
2     A flake    5.375840 1.0808914   3.356337 0.8907930     0.09371086
3     B  core    5.001447 1.5395513   2.782502 1.8897429     0.90683158
4     B flake    4.887032 0.8115953   3.599603 0.1439001     0.84982505
  Thickness_sd
1    0.3612724
2    0.2444654
3    1.4856763
4    1.3971892

I have recently come across an issue using aggregate(). This function has an argument na.action which is set to na.omit by default, which means that all rows with at least an NA in any of the columns selected will be excluded before grouping and calculating the stats. In other words, it keeps only so-called complete cases. doBy::summaryBy() and dplyr::summarize() do not do it, so results might be different if you have NAs.

If you want to use aggregate but want to keep all values, set the na.action argument to na.pass.
For example, let’s first add NAs to our previous data set:

df_csv_NA <- df_csv
df_csv_NA[2, 4] <- df_csv_NA[3, 5] <- NA
df_csv_NA

   Sample Layer  Type   Length    Width   Thickness
1   MON-1     A flake 4.439524 4.224082 -0.06782371
2   MON-2     A  core       NA 3.359814  0.78202509
3   MON-3     A flake 6.558708       NA -0.02600445
4   MON-4     A  core 5.070508 3.110683  0.27110877
5   MON-5     A flake 5.129288 2.444159  0.37496073
6   MON-6     B  core 6.715065 4.786913 -0.68669331
7   MON-7     B flake 5.460916 3.497850  1.83778704
8   MON-8     B  core 3.734939 1.033383  1.15337312
9   MON-9     B flake 4.313147 3.701356 -0.13813694
10 MON-10     B  core 4.554338 2.527209  2.25381492

Now, let us compare with na.omit (default) and na.pass in aggregate(), as well as with doBy::summaryBy(). We need to add the na.rm = TRUE argument to mean(), because that function returns NA when there are any NA as input:

aggregate(.~ Layer + Type, data = df_csv_NA[-1], na.action = na.omit, FUN = mean, na.rm = TRUE)

  Layer  Type   Length    Width Thickness
1     A  core 5.070508 3.110683 0.2711088
2     B  core 5.001447 2.782502 0.9068316
3     A flake 4.784406 3.334120 0.1535685
4     B flake 4.887032 3.599603 0.8498251

aggregate(.~ Layer + Type, data = df_csv_NA[-1], na.action = na.pass, FUN = mean, na.rm = TRUE)

  Layer  Type   Length    Width  Thickness
1     A  core 5.070508 3.235248 0.52656693
2     B  core 5.001447 2.782502 0.90683158
3     A flake 5.375840 3.334120 0.09371086
4     B flake 4.887032 3.599603 0.84982505

doBy::summaryBy(.~ Layer + Type, data = df_csv_NA[-1], FUN = mean, na.rm = TRUE)

  Layer  Type Length.mean Width.mean Thickness.mean
1     A  core    5.070508   3.235248     0.52656693
2     A flake    5.375840   3.334120     0.09371086
3     B  core    5.001447   2.782502     0.90683158
4     B flake    4.887032   3.599603     0.84982505

13 Graphics

R has countless ways of representing data graphically. I will present here shortly the most common types: histogram, scatterplot, barplot and boxplot.
For help on choosing which type of plots suits your data best, check from Data to Viz.

You will notice that I do not address pie charts here. In my opinion, and in that of other people (just Google “pie chart why not to use”), they poorly display the data. If you have used or seen pie charts, you have probably noticed that the proportions are always written in or next to the segments. This is necessary because our brains are bad with comparing areas or angles. So at the end, you need to look at the numbers to visualize the data. Basically, a table would do a better job (no unnecessary colors and pies)! If you really want to visualize data, barplots for example are in that sense much better because the heights of the bars can be much more easily compared, both qualitatively and quantitatively.

All graphics can be performed with the base functionalities on R (i.e. with packages and functions installed with R and loaded by default in every session). Three famous packages (lattice, plotrix, and ggplot2) offer different ways of plotting data; the functions from these packages are very powerful but require some more learning. At the end of the day, these packages cannot do more than what base R can, but they are usually more straightforward for complex plotting (assuming you know how to use them!). You will probably notice with the examples below that plotting with base R can quickly become complex with many lines of codes when you want to plot with conditions (e.g. one color per group).
For each type of graph, I will present base R and ggplot2 codes to plot. But first, let us look at how base R (section Graphical parameters) and ggplot2 (section Overview of ggplot2) work.

13.1 Graphical parameters

Each type of plot has an associated R function. But the look of the plot (as opposed to the way data are displayed) are all based on the same arguments, called graphical parameters. The complete list and details of the graphical parameters can be retrieved by typing ?par. I will not explain all of them, but only the ones I deem to be the most important. Still, it is worth looking at all of them because you might need them at some point.

Keep in mind that it might not be necessary to strive for the perfect-looking plot. The basics might be enough to produce a plot that you will further edit (see section Working with vector graphics).

cex: a numerical value giving the amount by which plotting text and symbols should be magnified relative to the default (1)
cex.axis, cex.lab, cex.main and cex.sub: the magnification to be used for axis annotation, x/y labels, titles and subtitles, respectively, relative to the current setting of cex.
col, col.axis, col.lab, col.main and col.sub: color for plotting symbols, axis annotation, x/y labels, titles and subtitles, respectively.
Colors can be specified in several different ways. The simplest way is with a character string giving the color name (e.g., "red"). A list of the possible colors can be obtained by typing ?colors. Alternatively, colors can be specified directly in terms of their RGB components with a string of the form #RRGGBB.
las: orientation of axis labels (0=parallel to axis [default], 1=horizontal, 2=perpendicular to axis, 3=vertical).
lty: line type (0=blank, 1=solid [default], 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash).
mfcol and mfrow: a vector of the form c(nr, nc) to specify the number of subplots, arranged in nr rows and nc columns. The grid will be filled by columns (mfcol) or rows (mfrow). This argument must be called before the plotting to prepare the plotting region: par(mfrow=c(nr,nc))
pch: plotting symbol. There are many ways to specify it, see ?points.
xaxt and yaxt: a character which specifies the x or y axis type. Specifying "n" suppresses plotting of the axis (which can be interesting, see section plot()).
xlog and ylog: a logical value to indicate whether a logarithmic scale should be used (default=FALSE).

Other parameters (not listed under par) can be found under ?plot and ?plot.default:

frame.plot: whether a box should be encase the plot (default=TRUE).
main and sub: title and subtitle of the plot.
type: type of plot to draw, e.g. p for points, l for lines, b for both… This parameter might become irrelevant depending on the plotting function used.
xlab and ylab: titles for the x and y axes.
xlim and ylim: the x and y limits of the plot. Given as vectors c(x1,x2) and c(y1,y2). x1 > x2 and y1 > y2 are allowed and leads to reversed axes. The default (NULL) indicates that the range of the finite values to be plotted should be used.

We will not try all these parameters in the examples below, but feel free to play with them!

13.2 Overview of ggplot2

The package ggplot2 is part of tidyverse. It works completely differently. A plot can be decomposed in three components: data, aesthetics and geometry. Most of the time, a ggplot plot is built incrementally by adding components/settings to the previous plot object. Once done, you print() the plot. Let us see that with an example, using the data from before, to create a barplot.

First, we create a plot with df_csv. It is important to assign the output to an object, as usual.

library(ggplot2)
p <- ggplot(data = df_csv)
print(p)

Nothing is printed because there is no aesthetics nor geometry yet. Let us add the aesthetics with aes().

p <- p + aes(x = Layer, y = Length, fill = Type)
print(p)

x and y specify which columns should be used on the x and y axes, respectively. fill defines the color(s) of the insides of the bars; here, we will use different colors for flakes and cores, so we just need to specify the column. color would specify the color(s) of the outlines of the bars.

You see that the plot is ready, but that data are not plotted yet. This is what geometry is for: specify how the data should be presented. This is where we specify that we want a barplot:

p <- p + geom_bar(stat = "identity")
print(p)

The "identity" value to the stat argument is just there to make sure that it plots “Length” on the y-axis rather than counts the number of values.

It is of course possible to change the colors as well:

p <- p + scale_fill_manual(values = c("red", "green"))
print(p)

One last important add-on function is the set of themes. These functions give pre-defined sets of graphic parameters like background color, grid pattern…
For example:

p <- p + theme_classic()
print(p)

Note that this example is not what we really need. See section geom_bar() for correct code for ggplot2 barplots.

It is not possible to try here all add-on functions of ggplot2, but I guess you understood how it works. We will have more examples below. Feel free to try things out too! At the very least, Google (or better, Swisscows, Brave Search or similar) is your friend! There are tons of webpages about ggplot2, so I am sure you will find what you need.
For example, the STHDA website covers most topics.
Additionally, the package esquisse will guide you with its GUI in preparing your ggplot2-plot interactively. You can then copy or even directly insert the code used to generate the plot into a script and further adjust it if needed.

13.3 Histogram

A histogram is used to visualize the distribution of the data: data are grouped into classes shown on the x-axis and the number of samples in each class is shown on the y-axis. With the sample data from above, there are not enough values per group to draw a histogram. So let us first create a new data.frame with random data drawn from a normal distribution, with two groups (“A” and “B”), with 50 samples each, having different means (2 and 5, respectively):

set.seed(123)
df_hist <- data.frame(group = rep(LETTERS[1:2], each = 50), 
                      var = c(rnorm(50, mean = 2), rnorm(50, mean = 5)))
str(df_hist)

'data.frame':   100 obs. of  2 variables:
 $ group: chr  "A" "A" "A" "A" ...
 $ var  : num  1.44 1.77 3.56 2.07 2.13 ...

The seed is set (set.seed()) to any arbitrary number so that we all get the same random numbers with rnorm().

13.3.1 `hist()`

hist(df_hist[df_hist$group == "A", "var"], breaks = seq(0, 8, 0.5))

The classes are defined from 0 to 8 (to include the whole range of var) with a 0.5 width, by default right-closed and left-open, i.e. (0-0.5], (0.5-1], (1-1.5]…, (7.5-8].
Obviously, some work on title and axis labels would be needed.

This function does not allow the plot of a histogram by group. You would need to plot 2 histograms, one for group “A” and another one for group “B”, which could be put on the same page using mfrow or mfcol (see section Graphical parameters).

13.3.2 `geom_histogram()`

Here it is very easy to plot a histogram for each group.

p <- ggplot(data = df_hist, aes(var, fill = group)) + 
     geom_histogram(breaks = seq(0, 8, 0.5), color = "black")
print(p)

It is also possible to separate both histograms into different plots using facet.grid(col~row), on top of each other in a column:

p <- ggplot(data = df_hist, aes(var)) + 
     geom_histogram(breaks = seq(0, 8, 0.5), color = "black") + 
     facet_grid(group~.)
print(p)

Or side by side on a row:

p <- ggplot(data = df_hist, aes(var)) + 
     geom_histogram(breaks = seq(0, 8, 0.5), color = "black") + 
     facet_grid(~group)
print(p)

Check this page for more details.

13.4 Scatterplot/dotplot

A scatterplot or dotplot is used to display the relationship between two continuous variables, one in x and the other in y. If one of the variable in discrete, barplots or boxplots are more appropriate.
In this case, we can use our original data (df_csv) again.

13.4.1 `plot()`

Let us first plot length vs. width, for all data taken together:

plot(df_csv$Length, df_csv$Width)

It would of course be nicer to have different colors or symbols to identify the samples from each layer and/or type. This is when things get complicated with base R. Dotplots are probably the most difficult plots to draw in base R. Things can be automatized through custom-functions and/or for loops (it is actually a good exercise for you) but let us see how things work. The process is as follows and is a bit similar to the ggplot2 principles:

Prepare the plot but do not plot data (type = "n"):

plot(df_csv$Length, df_csv$Width, type = "n")

Add the data points, one layer and one type at a time, with different colors and symbols, respectively.
[Unfortunately, for this tutorial, the call to plot() has to be in the same block as the calls to points() so I have to include the whole code at once. You would not have this problem if you do it in the console or editor.]

plot(df_csv$Length, df_csv$Width, type = "n")
points(df_csv[df_csv$Layer == "A" & df_csv$Type == "flake", "Length"], 
       df_csv[df_csv$Layer == "A" & df_csv$Type == "flake", "Width"], 
       pch = 21, bg = "red", col = "red")
points(df_csv[df_csv$Layer == "A" & df_csv$Type == "core", "Length"], 
       df_csv[df_csv$Layer == "A" & df_csv$Type == "core", "Width"], 
       pch = 22, bg = "red", col = "red")
points(df_csv[df_csv$Layer == "B" & df_csv$Type == "flake", "Length"], 
       df_csv[df_csv$Layer == "B" & df_csv$Type == "flake", "Width"], 
       pch = 21, bg = "blue", col = "blue")
points(df_csv[df_csv$Layer == "B" & df_csv$Type == "core", "Length"], 
       df_csv[df_csv$Layer == "B" & df_csv$Type == "core", "Width"], 
       pch = 22, bg = "blue", col = "blue")

Add the legend(s).
[For the reason explained above, I have to have the whole code to make it work. But for brevity, I exclude the calls to points()]

plot(df_csv$Length, df_csv$Width, type = "n")
legend(x = "topleft", legend = c("A", "B"), fill = c("red", "blue"), title = "Layer")
legend(x = "topright", legend = c("flakes", "cores"), pch = 21:22, col = "black", pt.bg = "black", 
       title = "Type")

13.4.2 `geom_point()`

Things are much easier with ggplot2.

p <- ggplot(data = df_csv, aes(x = Length, y = Width, color = Layer, shape = Type)) + geom_point()
print(p)

That’s it!
Check this page for more details.

13.5 Barplot

A barplot just shows bars for each value of a discrete variable in x, the height of which correspond to the value of a continuous variable in y.

To create a barplot, we need to work a bit on the data. We want only 1 value per group (Layer+Type), not raw data as in df_csv. So let us first aggregate the length data using the mean, as shown in section Aggregating:

df_mean <- aggregate(Length ~ Layer + Type, data = df_csv, FUN = mean)
df_mean

  Layer  Type   Length
1     A  core 4.920165
2     B  core 5.001447
3     A flake 5.375840
4     B flake 4.887032

13.5.1 barplot()

In base R, the data need to be presented differently, which is very annoying. Currently, the data is in so-called “long” format and we need a “wide” format.

library(tidyr)
df_wide <- pivot_wider(df_mean, id_cols = Type, names_from = Layer, values_from = Length)
df_wide <- as.data.frame(df_wide)

We can then plot it quite easily:

barplot(as.matrix(df_wide[, 2:3]))

The bars for each group can be drawn next to each other with beside = TRUE:

barplot(as.matrix(df_wide[, 2:3]), beside = TRUE)

A legend would be nice too:

barplot(as.matrix(df_wide[, 2:3]), beside = TRUE, legend.text = df_wide$Type, 
        args.legend = list(x = "topleft"))

13.5.2 `geom_bar()`

The most important things have been presented in section Overview of ggplot2. But as mentioned there, the barplot was wrong: we plotted the raw data rather than the aggregated data. So let us do it correctly now. For this we just need to plot the aggregated data:

p <- ggplot(data = df_mean) + aes(x = Layer, y = Length, fill = Type) + 
     geom_bar(stat = "identity")
print(p)

We can add also draw the bars for each group next to each other using position = position_dodge():

p <- ggplot(data = df_mean) + aes(x = Layer, y = Length, fill = Type) + 
     geom_bar(stat = "identity", position = position_dodge())
print(p)

Another nice function is geom_text() to add labels to the bars:

p <- ggplot(data = df_mean) + aes(x = Layer, y = Length, fill = Type) + 
     geom_bar(stat = "identity", position = position_dodge()) + 
     geom_text(aes(label = Length), vjust = -0.5, position = position_dodge(0.9))
print(p)

vjust = -0.5 sets the vertical position (positive values for inside the bars, negative values for outside) and position = position_dodge(0.9) sets the horizontal displacement of the labels. Values (-0.5 and 0.9 respectively) are somewhat arbitrary but work well.

Rounding the labels is necessary here to make sure that they fit into the bar:

p <- ggplot(data = df_mean) + aes(x = Layer, y = Length, fill = Type) + 
     geom_bar(stat = "identity", position = position_dodge()) + 
     geom_text(aes(label = format(Length, digits = 3)), vjust = -0.5, position = position_dodge(0.9))
print(p)

The function round() could also be used, but geom_text() somehow strips the zeros after the decimal place (i.e. 5 instead of the desired 5.00).

Check this page for more details.

It is also possible to add error bars to the bars of the plot (quite easily with ggplot2::geom_errorbar()). Note that it requires to also aggregate a measure of deviation (e.g. standard deviation) to define the error bars. However, I find boxplots (see section Boxplot) more useful for this purpose.

13.6 Boxplot

Boxplots, or box-and-whisker plots,visually summarize data. It works with a continuous variable in y and a discrete variable in x (for vertical boxes).
A lot of information is shown on a boxplot:

The thick horizontal line highlights the median.
The box represents the interquartile range (IQR), i.e. the range between the 25% and 75% quantiles, or 1^st and 3^rd quartiles.
The whiskers extend to the highest and lowest data points up to a defined threshold (in R, the default is 1.5 IQR on each side, but it can be adjusted with the argument range).
Data points beyond the whiskers are considered outliers and are represented by dots.
Notches can also be added: they show something similar to the 95% confidence interval of the median and extend to $±\frac{1.58 \times IQR}{\sqrt{n}}$

Because different software packages (Excel, R, SAS, SPSS…) potentially use different values for all these properties, it is important to explain them in the figure legends of a paper.

13.6.1 `boxplot()`

The boxplot() function can use formulas, so it is very easy to use to plot by group:

boxplot(Length~ Layer + Type, data = df_csv)

But in our case we obviously do not have enough samples to do it.

Check also ?bxp (the underlying function) for a complete list of graphic parameters (especially the outcex argument to adjust the size of the outlier dots, which are a bit small by default) and ?boxplot.stats for the underlying calculations.

13.6.2 `geom_boxplot()`

p <- ggplot(data = df_csv, aes(x = Layer, y = Length, fill = Type)) + geom_boxplot()
print(p)

The calculation of the box is a bit different from boxplot(), although it also represents the 1^st and 3^rd quartile. The rest (median, whiskers, outliers, notches) is identical between base R and ggplot2.

For a simple boxplot, geom_boxplot() does not add much to boxplot(). But many more things can be done: adjusting the color, size and symbol for outliers, and adding the mean or the data points for example.

Check this page for more details.

13.7 Graphical devices

13.7.1 Overview

The following part is mainly taken from the Exploratory Data Analysis with R book by Roger D. Peng.

A graphical device is something where you can make a plot appear. This can be either a window on your computer (screen device) or a PDF, JPG, SVG… file (file device).
When you make a plot in R, it has to be “sent” to a specific graphical device. The most common place for a plot to be “sent” to is the screen device. On Mac OS the screen device is launched with the quartz() function, on Windows with windows(), and on Unix/Linux with x11().

When making a plot, you need to consider how the plot will be used to determine what device the plot should be sent to. The list of devices supported by your installation of R is found in ?Devices. There are also graphical devices that have been created by users and these are available as packages on the CRAN. Note that not all graphical devices are available on all platforms.

For quick visualizations and exploratory analysis, usually you want to use the screen device. Functions like plot() in base R or ggplot() in ggplot2 will default to sending a plot to the screen device. On a given platform, such as Mac, Windows, or Unix/Linux, there is only one screen device.

For plots that may be printed out or be incorporated into a document, such as papers, reports, or slide presentations, usually a file device is more appropriate. There are many different file devices to choose from and exactly which one to use in a given situation is something we discuss below (see section Useful graphical file devices). It is also possible to use the base R and ggplot2 functions to save a plot; ggplot2 uses file devices in the background anyway. But for base R, use a file device explicitly.

We have already seen how to plot to the screen device. So let us focus on how to use the file devices.
Typically, there are three steps:

Start the desired graphical device and create a connection to the file that will ‘receive’ the plot
Do your whole plotting as shown above
Close the graphical device with the function dev.off()

For example, let us save into a PDF file.
But first let us create a folder “Plots” in your working directory:

dir.create("Plots")

Warning in dir.create("Plots"): 'Plots' already exists

Then we (1) start the graphical device, (2) plot, and (3) close the device:

pdf(file = "Plots/plot.pdf")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

I like to have the indentation of the plotting commands to clearly see when the graphical device is started and closed.

As you can see, a PDF file is created, and there is no plot in the plot window of R/RStudio.

13.7.2 Useful graphical file devices

Plots can be saved either as raster images (JPG, PNG, TIFF…) or as vector graphics (EPS, PS, PDF, SVG, AI…). The former is not recommended for plots; most journals require vector graphics because they are not made of pixels and can therefore be scaled (enlarged) without loosing resolution. The other advantage of vector graphics is that each point, each line, each element of the plot can be edited separately with software packages.

Unfortunately, it is very difficult to know which device works best on which computer (especially on different OS). Some will produce plots with elements merged together, other will produce plots that are unreadable and so on, all that depending on the editing software (see section Working with vector graphics). Let us try a few here and see what works best for me and for you.

With base R, we can try pdf(), svg(), postscript() and win.metafile(). We can also try two other packages like svglite::svglite().

pdf(file = "Plots/plot_base.pdf")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

png 
  2

svg(filename = "Plots/plot_baseSVG.svg")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

png 
  2

postscript(file = "Plots/plot_base.ps")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

png 
  2

win.metafile(filename = "Plots/plot_base.wmf")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

png 
  2

library(svglite)
svglite(file = "Plots/plot_SVGlite.svg")
    boxplot(Length ~ Layer + Type, data = df_csv)
dev.off()

png 
  2

With ggplot2, the function ggsave() is made to save a plot created with ggplot(). We can try with these values for the device argument: "eps", "ps", "pdf", "svg" and "wmf".

p <- ggplot(data = df_csv, aes(x = Layer, y = Length, fill = Type)) + geom_boxplot()
print(p)

ggsave(filename = "plot_ggplot2.pdf", path = "Plots", device = "pdf")
ggsave(filename = "plot_ggplot2.eps", path = "Plots", device = "eps")
ggsave(filename = "plot_ggplot2.ps",  path = "Plots", device = "ps" )
ggsave(filename = "plot_ggplot2.svg", path = "Plots", device = "svg")
ggsave(filename = "plot_ggplot2.wmf", path = "Plots", device = "wmf")

13.7.3 Working with vector graphics

This section is not directly related to R, but I thought it might be useful to discuss which software packages can be used to edit the files produced by different graphical file devices.

Here are the three main software packages on the market to work with vector graphics:

Illustrator is very expensive (about 285 Euro/year currently) so let us forget about it.
Inkscape is great because it is open-source, therefore free.
Designer is a cheap (55 Euro once) but powerful alternative.

I really like Designer. What makes it appealing (besides price) is that it belongs to the Affinity Suite and is therefore perfectly integrated with Publisher (same purpose as Adobe InDesign or Scribus) and Photo (same purpose as Adobe Photoshop or GIMP).

The Adobe products are great but there is a growing number of users favoring the Affinity products, and not just because of price.

Inkscape is a great open-source tool actively maintained and improved. But somehow, I cannot manipulate text correctly (letters overwriting themselves, line spacing not working…). Even more weirdly, every one I talk to has the same issues, but I cannot find anything related in forums or similar… The issue with text usually results in me rewriting every text box in every document, which is really wasting a lot of time.

One major problem with these open-source software packages is that they have been developed independently from each other. They are therefore not integrated with each other and the file formats are not compatible, unlike the Adobe and Affinity Suites where files can be edited with all software packages of the Suite.

So let us try the different files created above with Designer (v1.10.5) and Inkscape (v1.1.2), on Windows:

pdf() and ggsave("pdf") work fine in both Designer and Inkscape (with text issues).
postscript(), ggsave("ps") and ggsave("eps") files cannot be opened by Inkscape. Designer can open them but text is read as curves (i.e. not editable as text and not bound into words), which is annoying.
win.metafile() and ggsave("wmf") files cannot be opened by Inkscape. Designer can open them without problem.
svg() is OK with both software packages, but text is read as curves (i.e. not editable as text and not bound into words), which is annoying. ggsave("svg") is OK with both software packages. svglite::svglite() files can be opened in Designer, but for some reason, the lines have no style (i.e. they are not visible); you would have to add styles (solid, dash…) to every line. The files are fine in Inkscape.

pdf(), ggsave("pdf") and ggsave("svg") work fine with both software packages. PDF is probably the most widespread file format and is open. SVG is an open standard.
So I use PDF, but SVG is fine too, at least for ggplot2 graphics.

sessionInfo() and devtools::session_info() will show the specifications of your system and is important in such a case, where you want to know why a given device (or code) produces different results on different computers (Is it the OS? the version of R? the version of some packages installed? the locale?…). See section sessionInfo() and RStudio version.

13.8 Colors

When you plot, either in base R or with ggplot2, colors are often assigned automatically. You can also manually choose colors (see examples above). But choosing the appropriate color palette is no easy task and colors are often misused in science (see Crameri et al. 2020 for an overview).

For continuous color scales, it is important that the gradient is perceptually uniform. For all types of scales, people with color deficiency should also be able to read the color scales!

There are many alternatives to the standard, non-uniform and non-colorblind-friendly scales. In R, the packages RColorBrewer, viridis, scico and maybe also MetBrewer are good alternatives that are easy to use.

14 Control-flow constructs

Control-flow constructs relate to commands that define the order, and the conditions under which, commands are executed.
Here we will cover If…else…, for loops and while loops constructs. For the help page of all these functions, see ?Control.
We will also talk about pipes.

14.1 If…else…

This construct is used to define conditions under which some commands are executed, or how they are executed depending on some conditions.
if can be used alone or in combination with else, which cannot be used alone. ifelse() is completely different.

14.1.1 `if`

This construct is very easy: if(condition) commands.
If the condition is true, then execute the commands.

Example:
You want to compute the mean of a vector, but you are not sure a priori whether this vector is of mode numeric. So you want to compute the mean only if the vector is indeed numeric (to save computing time and avoid errors/warnings).
You could do it that way:

set.seed(123) 
vec1 <- rnorm(10)
vec1

 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

vec2 <- rep(c("a", "b"), each = 5)
vec2

 [1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"

vec1 is numeric and vec2 is character.

Now if the vector is numeric, compute the mean:

if (is.numeric(vec1)) mean(vec1)

[1] 0.07462564

vec1 is numeric so the mean is computed.

Try with vec2:

if (is.numeric(vec2)) mean(vec2)

vec2 is not numeric so the mean is not computed (and it looks like nothing happened).

We can expand the previous example to do more than one command, say, store the mean into an object called vec_mean and display this object:

if (is.numeric(vec1)) {
    vec_mean1 <- mean(vec1)
}
vec_mean1

[1] 0.07462564

Let’s try with vec2:

if (is.numeric(vec2)) {
    vec_mean2 <- mean(vec2)
}
vec_mean2

Error in eval(expr, envir, enclos): object 'vec_mean2' not found

Since the condition is not met, vec_mean2 <- mean(vec2) has not been run and vec_mean2 has not been created.

The braces {} are used to enclose a group of commands.
The opening brace should come right after the if() command, on the same line.
On the next lines are the commands to be executed if the condition is true, with indentation to make it cleaner.
Do not forget to close the braces on the final line.

Several conditions can be combined in an if statement using & and | (see section Subsetting based on values in one or several columns or rows).

14.1.2 `else`

Now let us say that you want to do something if the condition is not true, for example check which mode the vector is. This is what the else part does.

if (is.numeric(vec1)) {
    vec_out1 <- mean(vec1)
} else {
    vec_out1 <- mode(vec1)
}
vec_out1

[1] 0.07462564

Now let us do it for vec2:

if (is.numeric(vec2)) {
    vec_out2 <- mean(vec2)
} else {
    vec_out2 <- mode(vec2)
}
vec_out2

[1] "character"

Important here is that the else statement should be on the same line as the end of the if part (i.e. after the first }).

You can also write everything on the same line if you have only one statement for each part:

if (is.numeric(vec1)) mean(vec1) else mode(vec1)

[1] 0.07462564

if (is.numeric(vec2)) mean(vec2) else mode(vec2)

[1] "character"

14.1.3 `ifelse()`

In an if...else... statement, there should be only one test per condition. For example, let us say you want to test whether values are positive, because you want to compute the square root of the values (and square roots of negative number are not doubles, they are ‘not a number’ NaN, which is not the same as NA). This will not work as a condition because the result will have more than one logical value:

vec1 > 0

 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

If you try to run that in an if statement, R will throw a warning explaining that only the first result will be used. This makes perfect sense if you think about it: which one of the results should be used? R chooses, arbitrarily but consistently, to use the first one:

if (vec1 > 0) sqrt(vec1)

Error in if (vec1 > 0) sqrt(vec1): the condition has length > 1

Depending on whether the first value is positive or negative, the command(s) will be executed or not.

So the if...else... construct is not appropriate here.

For those cases where you want to replace values depending on one (or several) condition(s), the ifelse() function is the correct choice.

With our previous example:

ifelse(test = vec1 > 0, yes = sqrt(vec1), no = NA)

Warning in sqrt(vec1): NaNs produced

 [1]        NA        NA 1.2484824 0.2655342 0.3595660 1.3096049 0.6789081
 [8]        NA        NA        NA

Here is what happens: the test is run for every element of the object to be tested (here vec1); if the condition is true, then return the square root of the element being tested; if not, return NA (instead of NaN).
The output is a vector of the same length as vec1, with values being either NA or the square root of the values of vec1.

You get a warning, but the output is fine. To avoid it, you should construct it differently. First, replace negative values with NA using ifelse(), then apply sqrt():

sqrt(ifelse(test = vec1 > 0, yes = vec1, no = NA))

 [1]        NA        NA 1.2484824 0.2655342 0.3595660 1.3096049 0.6789081
 [8]        NA        NA        NA

As a side note, I want to emphasize that the second ifelse() construct is not necessary here. There is another way to replace values. First let us duplicate vec1, so that vec1 will show the unmodified vector, while vec3 will be modified:

vec3 <- vec1

It works like this:

vec3[vec3 <= 0] <- NA

This method seems a bit complicated but is very essential in R coding. Here is what happens:

<= is the symbol for lower than or equal to (see ?Comparison)
vec3 <= 0 is a logical vector, of the same length as vec3. Each element of vec3 will be tested for the condition (i.e. lower than or equal to 0); if the condition is true, then this element of the logical vector will have the value TRUE; if not, it will be FALSE.
Previously (cf. section 9.1 Vectors), we gave a vector of indices to subset a vector. But a logical vector can also be used to subset another vector, here vec3 (hence the single square brackets). This will build some kind of correspondence between the input vector vec3 and the sequence of TRUE/FALSE:

Vector	Value 1	Value 2	Value 3	…
vec3 (before modifications)	-0.5604756	-0.2301775	1.5587083	…
vec3 <= 0	TRUE	TRUE	FALSE	…

If we use this logical vector to subset vec3, only the elements corresponding the TRUE will be extracted; elements corresponding to FALSE will be ignored. vec3[vec3 <= 0] will therefore extracts only the negative values of vec3.
<- NA replaces these values with NA.
Finally, compute the square roots of this new vec3 vector: sqrt(vec3)
You can check that the result is identical to the ifelse() method above:

vec3[vec3 <= 0] <- NA
sqrt(vec3)

 [1]        NA        NA 1.2484824 0.2655342 0.3595660 1.3096049 0.6789081
 [8]        NA        NA        NA

14.2 for loops

for is used to iterate a sequence of commands in a loop. Usually, each iteration runs the same commands but on different (parts of) objects.

14.2.1 General information

The construct looks like this:

for (variable in sequence) commands

variable is an object that will contain the loop index (i.e. the value it will take at each iteration).
sequence is a vector containing all the values that will be recursively assigned to variable.
As with if...else..., a group of commands can be enclosed in braces. These commands will depend on variable.

A very inefficient, thus unrecommended, way to test whether the values of vec1 are positive would be to test each value recursively (I show it because it is very basic, so a good starting point):

for (i in vec1) print(i <= 0)

[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE

i will iteratively take the values of the elements of vec1 (i.e. -0.5604756, -0.2301775, 1.5587083…).
Then, at each iteration, it will be tested whether i is lower than or equal to 0.
print() is necessary in a loop to display the results into the console.

14.2.2 Implementing your loop

As you might have already understood, the critical part here is the sequence.
Depending on what your goals are, you can let i take some values from a vector (names, numerical values, …) as we just did.
Alternatively, i could be used for indexing the elements of a vector, a data.frame… This approach is very powerful for example to apply the same set of operations to all columns of a data.frame.

Let us elaborate on the previous example. It would actually make sense to store the results iteratively in a logical vector, rather than just displaying them into the console.
So the first thing to do it to create that vector, let us called it vec.loop, that will receive the results iteratively.
Then, in this case, i should not take the values of vec1 directly, but should be a vector of integers from 1 to the number of elements of vec1 (i.e. its length). These integers will be used as indices to subset vec1 so that the test can be performed on each element, as well as to subset vec.loop, so that the result will be stored in its corresponding element.
Finally, display vec.loop:

vec_loop <- vector(mode = "logical")
for (i in 1:length(vec1)) vec_loop[i] <- vec1[i] <= 0
vec_loop

 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

But let me remind you that this example should not be used in real situations. I used it only to make a parallel with the approaches used in the section ifelse(), which should instead be used in such situations.

14.2.3 Speeding up your loop

It does not seem so because the input and computation are limited, but the loop above is very slow. Actually, loops are usually quite slow in R. This is why the approach in section 14.1.3 ifelse() is much better. But sometimes, loops cannot be avoided.

In order to speed up the process as much as possible, here are two hints:

Do as much as possible outside of the loop. We were good on that aspect, because vec.loop was created and displayed outside of the loop. In more complicated cases it might not be as easy. But in any case, try to take as much as possible out of the loop. In general, if a command does not depend on your loop index (i in the examples above), then it can be taken out of the loop.
Set the length of your receiving object before the loop. In our example, vec.loop was expanded at each iteration. This slows down the computation a lot because the vector is expanded and its new element is filled at each iteration. If you set its length before the loop, each existing element will be iteratively filled. You will not notice any difference in that case, but with larger datasets, it might considerably speed up your loops.
The function vector() can actually be used to create empty lists of a given length (by setting the argument mode to "list").
So we should have written, before the loop:

vec_loop <- vector(mode = "logical", length = length(vec1))

14.2.4 Using lists

This brings us to the use of lists in loops.
At some point, you will surely want to create different objects iteratively from different input data.
A classical example of this approach is this situation: you have several table files that have the same structure; you want to import them all into R and then bind them together one after the other in order to have only one object/dataset.
The first thought would be to import each of them into a different R object like this (this is no real code, the function ‘import_from_Excel’ does not exist):
object1 <- import_from_Excel(file1)
object2 <- import_from_Excel(file2)
...

If you try to do this in a loop, you will need a way to create objects with different names within the loop. There is a function called assign() to do just that. But it should not be used in this case.
In this case, you should create an empty list with a length corresponding to the number of files with the function vector(mode = "list", length = ...). You can then easily name the elements of the list using names() <- and paste(). And in the loop, each file will be imported in each corresponding element of the list. After the loop, you can then bind the datasets together.

The only thing you still need to know is the function list.files(), and here it goes (not completely real code):

vec_files <- list.files(path/to/files, ...)  
list_df <- vector(mode = "list", length = length(vec_files)) 
names(list_df) <- paste("object", 1:length(vec_files), sep = "")  
for (i in 1:length(vec_files)){  
    list_df[[i]] <- import_from_Excel(vec_files[i])  
}  
do.call(rbind, list_df)

14.2.5 `*apply()` functions

It is beyond the scope of this tutorial, but the *apply() functions can always replace for loops, although they are not always easier to read.
All of them are listed under the ‘See Also’ section on the help page of lapply().

I prefer an explicit for loop though because it is more intuitive to me. And an efficient for loop is usually as fast as an implicit loop from *apply() functions.

Here is a real-case example (though simplified). I had some external information about which sample(s) belong to which experiment. I needed to get that manually into R (either typing directly in a script or read a CSV file). Then I had my data, only with the sample IDs.

exp <- list(ex1 = c("sample1-1", "sample1-2"), ex2 = c("sample2-1", "sample2-2" , "sample2-3"))
exp

$ex1
[1] "sample1-1" "sample1-2"

$ex2
[1] "sample2-1" "sample2-2" "sample2-3"

mydata <- data.frame(sample = c("sample2-2", "sample2-3", "sample1-1", "sample1-1", "sample1-1",
                                "sample2-1"))
mydata

     sample
1 sample2-2
2 sample2-3
3 sample1-1
4 sample1-1
5 sample1-1
6 sample2-1

Finally I wanted to add a column to the mydata with the experiment ID. I found a compact (but ugly) solution With a for loop:

for (i in names(exp)) mydata[mydata[["sample"]] %in% exp[[i]], "experiment"] <- i
mydata

     sample experiment
1 sample2-2        ex2
2 sample2-3        ex2
3 sample1-1        ex1
4 sample1-1        ex1
5 sample1-1        ex1
6 sample2-1        ex2

And here would be a solution with mapply():

mydata[["experiment2"]] <- NA_character_
mapply(\(value, name, s) {
  i <- which(s %in% value)
  mydata[["experiment2"]][i] <<- name
}, exp, names(exp), MoreArgs = list(s = mydata$sample))

  ex1   ex2 
"ex1" "ex2"

mydata

     sample experiment experiment2
1 sample2-2        ex2         ex2
2 sample2-3        ex2         ex2
3 sample1-1        ex1         ex1
4 sample1-1        ex1         ex1
5 sample1-1        ex1         ex1
6 sample2-1        ex2         ex2

Not more readable or shorter, right?

The best solution was, in that case, without any kind of loop! It uses merge(), but it requires the experiment information as a data.frame, though (here as if it was coming from a CSV file):

expts <- read.csv(text = "expt,sample
ex1,sample1-1
ex1,sample1-2
ex2,sample2-1
ex2,sample2-2
ex2,sample2-3
", header = TRUE, as.is = TRUE)
merge(mydata, expts, by = "sample", all.x = TRUE)

     sample experiment experiment2 expt
1 sample1-1        ex1         ex1  ex1
2 sample1-1        ex1         ex1  ex1
3 sample1-1        ex1         ex1  ex1
4 sample2-1        ex2         ex2  ex2
5 sample2-2        ex2         ex2  ex2
6 sample2-3        ex2         ex2  ex2

14.3 while loops

With a while construct, a set of commands will be run as long as the condition is true.

The construct is simple: while(condition) commands
This translates to: while the condition(s) is (are) true, execute the command(s). Again, braces can be used to enclose several commands.

Example:
Set x to 1 and incrementally add 1 to x as long as it stays smaller than 5.

x <- 1
while (x < 5) {
    print(x)
    x <- x + 1
}

[1] 1
[1] 2
[1] 3
[1] 4

In a while loop, as in a for loop, print() is needed to display output into the console.

This is not a particularly realistic example, but it should be enough to understand the logic. In real situations, while loops are often combined with if statements and/or break to control the flow of the loop.

I actually basically never use while loops because it never happens that I do not know in advance how many iterations should be run; for loops and if statements are usually enough. But I can guess things are different in e.g. simulations.

14.4 Pipes

14.4.1 Introduction

Pipes could deserve their own section, but I argue it still fits in this section!

Pipes have been originally developed as part of the magrittr package.
As per the package’s website, “The magrittr package offers a set of operators which make your code more readable by:

structuring sequences of data operations left-to-right (as opposed to from the inside and out),
avoiding nested function calls,
minimizing the need for local variables and function definitions, and
making it easy to add steps anywhere in the sequence of operations.”

In other words, it streamlines your code. See R for Data Science for more examples and details.

14.4.2 Some examples

Let’s say we want to compute the mean of 10 random values from a normal distribution and round it to 2 digits. Without pipes, we would do this:

set.seed(123) # just to make sure you get the same values as I do
random_values <- rnorm(10)
mean_values <- mean(random_values)
rounded_mean <- round(mean_values, digits = 2)
rounded_mean

[1] 0.07

This creates a lot of intermediary objects. We could avoid them by doing this:

set.seed(123) # just to make sure you get the same values as I do
rounded_mean2 <- round(mean(rnorm(10)), digits = 2)
rounded_mean2

[1] 0.07

identical(rounded_mean, rounded_mean2)

[1] TRUE

But here we now have a nested structure round(mean(rnorm())) that must be read inside out, which is not very intuitive. And if you add arguments to each function’s call, it becomes unreadable:

set.seed(123) # just to make sure you get the same values as I do
rounded_mean3 <- round(mean(rnorm(10, mean = 1), trim = 0.1), digits = 2)
rounded_mean3

[1] 1.04

Of course, you could write each function on its own line, but then the arguments are not on the same line as the function itself. So it is messy whatever you do.

This is where pipes are great:

library(magrittr)
set.seed(123) # just to make sure you get the same values as I do
rounded_mean4 <- rnorm(10, mean = 1) %>% 
                 mean(trim = 0.1) %>% 
                 round(digits = 2)
rounded_mean4

[1] 1.04

The %>% symbol (RStudio shortcut on Windows = CTRL+SHIFT+m) carries the output of the left-hand side toward the right-hand side (or next line here) and inputs it into the function’s call of the right-hand side automatically.

By default, the output of the left-hand side is assigned to the first argument of the function (usually the data argument) on the right-hand side. But sometimes, you want to assign it to another argument. For this, you have a placeholder (.) for the output of the left-hand side and you can assign it to any argument on the right-hand side. For example:

set.seed(123) # just to make sure you get the same values as I do
rounded_mean5 <- rnorm(10, mean = 1) %>% 
                 mean(trim = 0.1) %>% 
                 round(digits = 2) %>% 
                 paste("the rounded mean is:", .)
rounded_mean5

[1] "the rounded mean is: 1.04"

Sometimes, pipes do not work and I cannot tell you why! But sometimes, pipes just cannot work, especially when you have multiple inputs or outputs, or non-linear relationships, for example:

x <- 1:10
names(x) <- paste0("n", x)

14.4.3 Other tools from magrittr

magrittr can do much more than “simple” pipes, and I have honestly never tried them but they look amazing. See R for Data Science for details.

14.4.4 base R

Since R 4.1.0, pipes exist in base R too. The symbol is different: |>.

set.seed(123) # just to make sure you get the same values as I do
rounded_mean6 <- rnorm(10, mean = 1) |> 
                 mean(trim = 0.1) |> 
                 round(digits = 2)
rounded_mean6

[1] 1.04

While this great news, its use is currently limited. Since R 4.2.0, there is even a placeholder (_) but it can only be used once on the right-hand side (see here) and in named arguments (does not work in paste() for example). This is often enough, but not always. Additionally, the other types of pipes from magrittr (see section Other tools from magrittr) are not implemented in base R; they might come, but probably not soon.

15 How to write a function

Thousands of functions are available in R (either through the base installation or from contributed packages). But sometimes, you want a function to fit your specific needs. Custom functions are usually useful when you want to apply the same set of operations several times on different (part of) objects. In other words, this is a way to automatize your code, and is often combined with other control-flow constructs.
Nevertheless, remember that custom functions are only useful if they can be shared. For this, you can define them in the script(s) where they are used, write them in a script that you source() to make the functions available, or even better write a package. I don’t know much about the latter, so if you are interested, maybe the book R Packages by Hadley Wickham and Jenny Bryan can help you.

This section explains the basics of how to write a custom function.

15.1 General information

Remember from the section Function that functions are objects. So writing a function means storing a sequence of commands into an object. In R, this it done with assignment as with any object:

name_of_function <- function(arguments) {commands}

The ‘name_of_function’ is submitted to the same rules as the name of any R object (see section Assignment). To this object, you assign a function using the reserved word function. Within the brackets, you list the arguments that will be used by the commands, enclosed in braces. The commands constitute the body of the function.

For example, if you type mode into the console, you will see the commands contained within that function:

mode

function (x) 
{
    if (is.expression(x)) 
        return("expression")
    if (is.call(x)) 
        return(switch(deparse(x[[1L]])[1L], `(` = "(", "call"))
    if (is.name(x)) 
        "name"
    else switch(tx <- typeof(x), double = , integer = "numeric", 
        closure = , builtin = , special = "function", tx)
}
<bytecode: 0x000001f71cae4f60>
<environment: namespace:base>

mode() is a function with only one argument called x (this is consistent with the help page). The commands to be executed are listed within the braces. What comes after the braces is not important here.

15.2 Arguments

The critical part here is the argument list. You have to make sure that you have listed all variables that will be used by the commands and that their modes and classes correspond to the requirements of the commands.
The good thing is that you do not have to explicitly set the modes and classes of arguments in the function’s definition. But this is why you have to be extra careful on that.

Let us work with the example we used in sections if and else:

vec1

 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

vec2

 [1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"

if (is.numeric(vec1)) {  
    vec_out1 <- mean(vec1)  
} else {  
    vec_out1 <- mode(vec1)  
}
vec_out1

[1] 0.07462564

In these sections, we had edited and run these commands on both vec1 and vec2. This approach can quickly become annoying if you have many vectors to test. This is the perfect set up for a custom function.

This custom function will test whether an input vector is of mode numeric. If it is, then compute its mean; if it is not, then show its mode.

Let us do it step by step.
First, define the name of your function object (let us call it fun_vec), assign the function word to it and list the arguments. Only one argument is needed here, the input vector (let us call it vec_in). So:

fun_vec <- function(vec_in) {}

Second, focus on the function’s body. In this case, we can just use what we wrote before, changing vec1 to vec_in:

fun_vec <- function(vec_in) {  
    if (is.numeric(vec_in)) {  
        vec_out <- mean(vec_in)  
    } else {  
        vec_out <- mode(vec_in)  
    } 
    vec_out
}

Indentation is very important here to make your code clean and easy to follow.
What will happen when you run this function is that the value assigned to the argument vec_in will be used as arguments to the functions mean() and mode().

By default, R will return the last evaluated expression. When you do not want the last evaluated expression as the output of the function, you need to use the function return(). But I like to be explicit about what the function should return, and I almost always use return() (although not advised on the tidyverse style guide):

fun_vec <- function(vec_in) {  
    if (is.numeric(vec_in)) {  
        vec_out <- mean(vec_in)  
    } else {  
        vec_out <- mode(vec_in)  
    } 
    return(vec_out)
}

Be careful where you return() statement is; there are many braces and it should be after the if...else... statement. This could be written without all those braces for if and else but I will leave that to you!

Now, let us try it, for both vec1 and vec2:

out1 <- fun_vec(vec_in = vec1)  
out1

[1] 0.07462564

out2 <- fun_vec(vec_in = vec2)  
out2

[1] "character"

The mode and class of the receiving object must match the output of the function. This is no problem in this case because we used new objects out1 and out2, but you will have to check this when you want to store the output of a function into an element of an existing objects (for example in for loops).

As a side note, if you want to run this function on a lot of vectors, the easiest would be to put them into a list and apply the function to every element of the list (see section Using lists):

vec_list <- list(vec1, vec2) 
names(vec_list) <- c("vec1", "vec2")
vec_list

$vec1
 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

$vec2
 [1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"

This step is done manually here, which is not a good idea if you have lots of vectors. But in a real situation, you would ideally store these vectors directly into the list as you create them.

With a for loop:

# Prepare the "receiving" object outside of the loop
out_list <- vector(mode = "list", length = length(vec_list))
names(out_list) <- names(vec_list)
out_list

$vec1
NULL

$vec2
NULL

# Fill in out_list in a loop
for (i in seq_along(vec_list)) {
    out_list[[i]] <- fun_vec(vec_list[[i]])
}
out_list

$vec1
[1] 0.07462564

$vec2
[1] "character"

With lapply() or sapply() (identical in that case), which are much simpler in this case:

out_lapply <- lapply(vec_list, fun_vec)
out_lapply

$vec1
[1] 0.07462564

$vec2
[1] "character"

out_sapply <- lapply(vec_list, fun_vec)
out_sapply

$vec1
[1] 0.07462564

$vec2
[1] "character"

15.3 Scope and environment

Detailed explanations of environments are way beyond the scope of this tutorial, and beyond my knowledge as well. But some basics are important to understand for writing functions.
A more thorough documentation can be found in the Advanced R book by Hadley Wickham.

An environment is a storage place where values are bound (associated) to names. This is what we have been doing the whole time with assignments in the form object <- value.
A handy function for listing all objects (with an extra argument to specify in which environment to look for) is ls().

There are lots of types of environments but let us focus on two that are of direct importance to us: the global environment and the function’s execution environment.

The global environment is basically where you work in with the console and editor. Every object you create by assigning values to it via <- will be created in the global environment, except if these assignments take place within the body of a function.
When a function is called, i.e. executed, an environment is created to host all the objects defined within the function’s body. This is the function’s execution environment. When the execution is over, this environment is deleted. This becomes clear when you look for the objects defined within the function fun_vec() above:

vec_out

Error in eval(expr, envir, enclos): object 'vec_out' not found

This object exists only within the function’s execution environment and as such cannot be used from the global environment.

For some reason, you might want to create objects into the global environment from within a function. Two functions, assign() and <<-, can do this. This might be a valid approach, but do NOT do it! Only very experienced users should use it. In general, you should avoid using these functions. The R way is to create objects from the output of functions, not to create objects within functions. If you need to have more than one value as output from your function, then just combine these values into a vector, matrix, data.frame or list within the function and return() this object (see e.g. section Another example).

15.4 Comments

The functions we wrote are very basic, but it gets quickly necessary to comment the different steps, especially if other persons might read your code. Actually, after a few days, weeks, months, years, you will yourself forget what you wanted to do and why you did it that way! So, do not be shy with #, it can only be beneficial!

A useful function is comment(). It adds a comment attribute to any existing object, including functions. You can then call comment(function) to display the comments associated with the function. It can be a quick and easy way to get important information about the function (like description, usage and arguments), without going through the whole code.

15.5 Another example

To show you another example, and especially how to combine values within a function in order to return a single output, say you want to compute the mean and the standard deviation (SD) of a numeric vector.

Here are the different steps:
1. Test whether the vector is numeric; if it is, compute mean and SD; if not, return an error “The input data is not of mode numeric but of mode…”
2. Compute the mean, with the possibility to set the na.rm argument to TRUE (the default)
3. Compute the SD, with the possibility to set the na.rm argument to TRUE (the default)
4. Combine mean and SD into a named vector
5. Return it
6. Add some comments to the function

mean_sd <- function(vec_in, na = TRUE) {
# Step 1
  if (!is.numeric(vec_in)) stop(paste("The input data is not of mode numeric but of mode", 
                                      mode(vec_in)))
# Step 2
  mean_test <- mean(vec_in, na.rm = na)
# Step 3
  sd_test <- sd(vec_in, na.rm = na)
# Step 4
  out <- c(mean_test, sd_test)
  names(out) <- c("Mean", "SD")
# Step 5 (note that return is necessary here)
  return(out)
}
# Step 6
comment(mean_sd) <- c("Function to compute mean and SD of a numeric vector", 
                      "Arguments: vec_in, na = TRUE")

Now let us try it:

vec3 <- 1:10
mean_sd(vec3, na = TRUE)

   Mean      SD 
5.50000 3.02765

mean_sd(vec3, na = FALSE)

   Mean      SD 
5.50000 3.02765

vec4 <- c(1:5, NA, 6:10)
mean_sd(vec4, na = TRUE)

   Mean      SD 
5.50000 3.02765

mean_sd(vec4, na = FALSE)

Mean   SD 
  NA   NA

vec5 <- rep("a", 10)
mean_sd(vec5)

Error in mean_sd(vec5): The input data is not of mode numeric but of mode character

comment(mean_sd)

[1] "Function to compute mean and SD of a numeric vector"
[2] "Arguments: vec_in, na = TRUE"

16 Data, scripting, projects and repeatability

Working with scripts is very powerful as it is then possible to automatize analysis, but it also requires some degree of organisation. I try to list here some good practice to make sure that you can always repeat a specific analysis on a specific dataset.
I have changed my way of dealing with data over the years, mostly by realizing (too late) that I should have done things differently. I can only speak with my own experience and your use might be different to mine, so not everything might be good for you. But in any case, give it a thought.

16.1 Data

The first thing to think about is your data. For every project I have (usually a paper), I create a directory to contain all data related to R. This directory contains sub-folders for copies of the raw data (ideally read-only), both as Excel and R files, as well as for all scripts and results (tables, graphs…). At some point, tables and graphs output from R needs some editing. But I edit only copies in other folders so that the original results are always available in the R folder for a check.

16.2 Scripting

Every analysis you want to run in R should be written as a script (see section The editor). My advice is, as you develop the code of your analysis, do so in the editor and save the script before running any line of code. This is important as some (erroneous) commands might hang forever and you would need to force quit.
Once you are done with developing, make sure you close R without saving. Then restart R and run again your script. This is very important too because you might have tried your code using objects that you defined during your tests but that are different in your final code. This would lead to unexpected results and it is sometimes very difficult to spot the problems. Read again section Debugging.

Also, I used to have a “model” script that I then copied and edited as necessary for each analysis. Now, I only have scripts for analyses; the model script was not really useful anymore. The drawback is that it is sometimes difficult to find in which script I implemented which specific code. But in any case, it is important to save one script for each analysis, as opposed to one script that gets edited over and over (loosing the record of what has been done in previous analyses).

I used to write a lot of functions to make the application more general, i.e. to make sure the same script can be applied to different dataset. I also used to use file.choose() a lot, which means that the path to the raw data file is not hard coded but can be chosen for each analysis. But at the same time, it is important to make sure that a script contains all the information required to repeat the analysis: which input/output files, which data, which settings…
Here are three ways of dealing with that:

Hard coding is of course one way of doing it, but you would have to edit the file names and/or paths for every new analysis.
Another way is to leave it general but write the information as comments for each analysis, e.g. x <- file.choose() #"path/file.ext". At the end, it is not much different from hard coding; the only advantage is that the code itself is not edited (to avoid errors), only the comments are edited.
A more involved but great way of doing it is by using R Markdown (packages rmarkdown and knitr). To summarize, these packages allow the creation of an html/pdf/Word… document that contains code, results and text. Some people write complete papers with that! Without going to such extremes, this allows a general code with the display of the used values and corresponding results in one file. This file can be sent easily, meaning that it is also a great way to share code, data and results with colleagues. For example, this tutorial has been prepared that way, in RStudio.
It is beyond the scope of this tutorial to teach you on R Markdown. For this, check my other tutorial Reproducible Analysis with R and Git!

Finally, none of this tells you which version of a file was used (the file might have been edited in the meantime). For this, you could add and save the file.info() of every input/output file you use in your scripts, or computing a checksum (e.g. MD5 hashes with the function md5sum()). This is very easy to do in a R Markdown project, but requires writing to a file if you use a script.

Additionally, consider adding the sessionInfo() or devtools::session_info() at the end of your script to know which version of OS, R and packages were used.

16.3 Project-oriented workflow

To complement what I wrote above about making sure that you know which files have been used for, or produced by, what analysis, think about project-oriented workflows.
I try to give an overview here, but more details can be found here (and this is where I discovered the topic anyway), or here and there .

The problem with defining working directories or paths in R (in a script or in the console) is that the code is not portable: you cannot use it on another computer of yours, nor send it to R-speaking colleagues.
To deal with it, first create a folder to contain a project, as I explained before. Then you have three possibilities:

When working with RStudio, rather than opening RStudio and then opening the script from within, open the script directly by double clicking on the script file in the Windows Explorer / macOS Finder. By doing this, the working directory is set to the folder where the script is saved in, i.e. every output will be saved there and it will look for input files there too. It is then not necessary anymore to specify the path (except if you want sub-folders).
Use the package here to define relative path. RStudio does a great deal already, but it might still come handy.
Work with RStudio projects. Go to File > New Project and either create a new directory or use a pre-existing folder to contain the project. RStudio will create a *.Rproj file that will contain everything related to this project. You don’t need to think too much about file path and directories then.

17 Packages

Packages contain functions to run operations on objects. The R base package “contains the basic functions which let R function as a language” but many contributed packages are so useful that they are installed together with R (I couldn’t find a list unfortunately). There are tons of extra contributed packages that can be installed, many from the CRAN and probably as many from GitHub.

To be able to run the functions available within a package, the package should first be installed. To find out which packages are already installed on your computer, type library() in the console.
To install packages, you can use menus, the Packages panel in RStudio, or type install.packages("name_of_package") in the console.
You should also regularly update packages, either through menus or update.packages().
I advise you to do this as administrator (see section Windows for instructions).

But installing a package is not enough to use it. You should then load it in the workspace by typing library(name_of_package) in the console (with or without quoting the name of the package) or by explicitly referring to the package when calling a function with the double colon symbol name_of_package::name_of_function().

A frequent error in R is: Error: could not find function 'name_of_function'. There are 2 possibilities:

You misspelled the function’s name (remember that R is case-sensitive)
You have not loaded the package containing this function

For some more details, go there.

18 How to find a function?

Now that you can manipulate R objects, you should be able to transform an object as necessary to apply any function to it. So the only thing you need now is vocabulary.

Here are some ways to find a function that can perform a specific task:

The function apropos() finds any available function (i.e. in a loaded package) from a part of its name. Try for example apropos("read"). The arguments are not case-sensitive
If you know in which package the function is but cannot remember its name, you can try searching the help of a package: help(package = "name_of_package")
If you know the function’s name but cannot remember in which package it is included (and so which package should be loaded): ??function. With 1 question marks, R looks for help pages of loaded functions. With 2, R looks for help pages of installed functions. With 3, R throws an error. With 4… I’ll let you try!
The function RSiteSearch() searches a database of functions and vignettes
Rseek uses a custom Google search for R
Google (or alternatives) is your friend, who correctly recognize the letter “R” in this context!
Lastly, for any help related to R, there is a mailing list. But be careful: read the posting guide carefully (especially the part explaining how to send a reproducible example) before sending a request on the list. But if you follow the posting guide, you will surely get help.

19 Understand the help page of a function

The help page is always organized the same way. For example with the function mean():

name of the function {package containing the function}, here mean {base} (‘base’ being the built-in functions).
Description: what the functions does, here computes the arithmetic (trimmed) mean.
Usage: the function, its arguments, the order of arguments and their default values. Some arguments have default values (name_of_argument = default_value), while others do not (only the name is given); with no default, a value must be assigned to the argument in the function’s call.
In our example, argument x has no default value, while trim and na.rm do (0 and FALSE, respectively). x is thus necessary; but if you do not assign values to trim and na.rm, then default values will be used.
There sometimes exist several ‘methods’ depending on the class of the input object.
Arguments: description of arguments, their classes, modes, lengths, possible values, etc. Often complicated, but very important.
Value: what the function returns (output). This is also very important to know into which object (class, mode, length…) the output can be stored.
References: where the function or concept has been defined
See also: similar or associated functions
Examples: some examples of how the function can be used (honestly, rarely useful).
At the very bottom, you can find the index of functions available within the package containing this function (here ‘base’), which can be useful. This is the same as help(package = "name_of_package") (see section How to find a function?)

20 Where to find documentation?

There are lots of books, blogs, tutorials… Here are two that may be relevant to a beginner:

Impatient R by Patrick Burns
R for beginners by Emmanuel Paradis (translated from French)

The CRAN also lists a lot of manuals and FAQs (under ‘Documentation’) on the left.

The RStudio Cheat Sheets are really helpful.

Do not forget that R is primarily used for statistics. So you should also have some knowledge of what you intend to do!

21 To go further

Here are some suggestions to further your learning:

Files, folders and paths: file.choose(), list.files(), choose.dir(), dir.create(), dirname(), file_ext() (and associated functions), file.info() and source() (see also Windows FAQ 2.16)
Data manipulation: packages plyr, dplyr and data.table
Save/load data: save() and load(), and saveObject() and loadObject() from the R.utils package
Character manipulation : see ?regex and cited functions, paste() and paste0()
*apply() functions: apply(), sapply(), lapply()…
Sequence: seq(), seq_along(), seq_len()
NA/NaN: is.na(), is.nan()
Comparison: see ?Comparison, identical(), all.equal() and R FAQ 7.31
Binding: cbind(), rbind(), do.call()
Display: cat(), print(), head(), tail(), show(), View()
Interaction: select.list(), menu() and package tcltk
Miscellaneous: which(), unlist()

You can also check my other tutorial: ReproducibleAnalysisRGit.

If you have managed to reach this point and to understand everything, then you should not need me anymore!

22 sessionInfo()

sessionInfo()

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] grateful_0.2.4 magrittr_2.0.3 svglite_2.1.3  tidyr_1.3.0    ggplot2_3.4.4 
 [6] dplyr_1.1.4    moments_0.14.1 readODS_2.1.0  openxlsx2_1.2  readxl_1.4.3  

loaded via a namespace (and not attached):
 [1] sass_0.4.8            utf8_1.2.4            generics_0.1.3       
 [4] stringi_1.8.3         lattice_0.22-5        hms_1.1.3            
 [7] digest_0.6.33         evaluate_0.23         grid_4.3.2           
[10] fastmap_1.1.1         cellranger_1.1.0      rprojroot_2.0.4      
[13] jsonlite_1.8.8        Matrix_1.6-4          writexl_1.4.2        
[16] zip_2.3.0             backports_1.4.1       purrr_1.0.2          
[19] fansi_1.0.6           scales_1.3.0          doBy_4.6.20          
[22] textshaping_0.3.7     microbenchmark_1.4.10 jquerylib_0.1.4      
[25] cli_3.6.2             rlang_1.1.3           crayon_1.5.2         
[28] munsell_0.5.0         withr_2.5.2           cachem_1.0.8         
[31] yaml_2.3.8            tools_4.3.2           tzdb_0.4.0           
[34] colorspace_2.1-0      Deriv_4.1.3           broom_1.0.5          
[37] vctrs_0.6.5           R6_2.5.1              lifecycle_1.0.4      
[40] MASS_7.3-60           ragg_1.2.7            pkgconfig_2.0.3      
[43] pillar_1.9.0          bslib_0.6.1           gtable_0.3.4         
[46] glue_1.7.0            Rcpp_1.0.12           systemfonts_1.0.5    
[49] highr_0.10            xfun_0.41             tibble_3.2.1         
[52] tidyselect_1.2.0      rstudioapi_0.15.0     knitr_1.45           
[55] farver_2.1.1          htmltools_0.5.7       labeling_0.4.3       
[58] rmarkdown_2.25        readr_2.1.5           compiler_4.3.2

23 Cite R packages used

library(grateful)
pkgs <- cite_packages(output = "table", include.RStudio = TRUE, out.dir = ".", 
                      bib.file = "initiationR", omit = "openxlsx")
knitr::kable(pkgs)

Package	Version	Citation
base	4.3.2	R Core Team (2023)
doBy	4.6.20	Højsgaard and Halekoh (2023)
grateful	0.2.4	Francisco Rodriguez-Sanchez and Connor P. Jackson (2023)
knitr	1.45	Xie (2014); Xie (2015); Xie (2023)
moments	0.14.1	Komsta and Novomestky (2022)
openxlsx2	1.2	Barbone and Garbuszus (2023)
readODS	2.1.0	Schutten et al. (2023)
rmarkdown	2.25	Xie, Allaire, and Grolemund (2018); Xie, Dervieux, and Riederer (2020); Allaire et al. (2023)
svglite	2.1.3	Wickham et al. (2023)
tidyverse	2.0.0	Wickham et al. (2019)
writexl	1.4.2	Ooms (2023)

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2023. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Barbone, Jordan Mark, and Jan Marvin Garbuszus. 2023. Openxlsx2: Read, Write and Edit “xlsx” Files. https://janmarvin.github.io/openxlsx2/.

Francisco Rodriguez-Sanchez, and Connor P. Jackson. 2023. grateful: Facilitate Citation of r Packages. https://pakillo.github.io/grateful/.

Højsgaard, Søren, and Ulrich Halekoh. 2023. doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities. https://CRAN.R-project.org/package=doBy.

Komsta, Lukasz, and Frederick Novomestky. 2022. moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests. https://CRAN.R-project.org/package=moments.

Ooms, Jeroen. 2023. writexl: Export Data Frames to Excel “xlsx” Format. https://CRAN.R-project.org/package=writexl.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schutten, Gerrit-Jan, Chung-hong Chan, Peter Brohan, Detlef Steuer, and Thomas J. Leeper. 2023. readODS: Read and Write ODS Files. https://CRAN.R-project.org/package=readODS.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Lionel Henry, Thomas Lin Pedersen, T Jake Luciani, Matthieu Decorde, and Vaudor Lise. 2023. svglite: An “SVG” Graphics Device. https://CRAN.R-project.org/package=svglite.

Xie, Yihui. 2014. “knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Initiation to R

Ivan Calandra

2024-01-11 10:41:28 CET

1 Goal of this tutorial

2 What’s R?

3 How to install R?

3.1 Windows

3.2 macOS

3.3 Linux

3.4 Install RStudio

4 Tutorial

5 R programming

5.1 The console

5.2 The editor

5.3 Exit R

5.4 RStudio

5.5 Coding style

5.6 Debugging

6 What is an object?

6.1 Definition

6.2 Assignment

6.3 Naming of objects

7 Modes

7.1 Numeric

7.2 Character

7.3 Logical

7.4 Function

7.4.1 General information

7.4.2 Arguments

7.4.3 Example

7.5 Exercises

8 Classes

8.1 Vector

8.2 Matrix

8.3 Array

8.4 List

8.5 Data.frame

8.6 Factor

8.7 Dates and time

8.8 Exercises

9 Manipulate objects

9.1 Vectors

9.2 Matrices and arrays

9.3 Lists

9.4 Data.frames

9.4.1 Simple subsetting

9.4.2 Subsetting based on values in one or several columns or rows

9.4.3 Add columns and do maths on columns

9.4.4 Reorganize columns and rows

10 Summary

10.1 Modes, classes, data objects, functions and arguments

10.2 Brackets, square brackets and braces

11 Reading data into R

11.1 Create some data files

11.2 read.table()

11.3 read.csv()

11.4 Reading XLS or XLSX files directly

11.5 Reading ODS files directly

12 Descriptive statistics

12.1 Sample size

12.2 Central tendency

12.3 Variability

12.4 Distribution

12.5 Aggregating

13 Graphics

13.1 Graphical parameters

13.2 Overview of ggplot2

13.3 Histogram

13.3.1 hist()

13.3.2 geom_histogram()

13.4 Scatterplot/dotplot

13.4.1 plot()

13.4.2 geom_point()

13.5 Barplot

13.5.1 barplot()

13.5.2 geom_bar()

13.6 Boxplot

13.6.1 boxplot()

13.6.2 geom_boxplot()

13.7 Graphical devices

11.2 `read.table()`

11.3 `read.csv()`

13.3.1 `hist()`

13.3.2 `geom_histogram()`

13.4.1 `plot()`

13.4.2 `geom_point()`

13.5.2 `geom_bar()`

13.6.1 `boxplot()`

13.6.2 `geom_boxplot()`

14.1.1 `if`

14.1.2 `else`

14.1.3 `ifelse()`

14.2.5 `*apply()` functions