R Basics

Rüçhan Ekren
Jean Monlong
Margot Zahm

Why R?

Why R?

Simple

  • Interpretative language (no compilation needed)
  • No manual memory management
  • Vectorized

Free

  • Widely used, vast community of R users
  • Good life expectancy

Why R?

Flexible

  • Open-source: anyone can see/create/modify
  • Multiplatform: Windows, Mac, Unix… It works everywhere

Trendy

  • More and more packages
  • More and more popular among data scientists and (now) biologists

Lots of bioinfo packages

Workshop Setup

  • Open

Logo

Workshop Setup

  • Open

Logo

Workshop Setup

Open a new R script file (File > New File > R Script)

Workshop Setup

Console

  • Where R is running
  • You can write and run the commands directly here
  • Your command executes when you press Enter

Workshop Setup

Console

Script

  • A text file with commands. Extension .R
  • To keep a trace of your analysis
  • Highly recommended
  • Run commands from a script to the console with Run button

Workshop Setup

Console

Script

Tracking panel

  • List all variables you generated
  • An history of the commands you ran

Workshop Setup

Console

Script

Tracking panel

Multipurpose panel

Check files in your computer, see plots, manage packages, read help section of a function.

Workshop Setup

Console

Script

Tracking panel

Multipurpose panel

Caution

Write everything you do in scripts to avoid loosing your work.

When you get an error

  1. Read the command, look for typos
  2. Read the error message
    1. and 2. again
  3. Raise your hand, someone will assist you

Tip

Solving errors is an important skill to learn.

Objects

Objects - Overview

Unit type

  • numeric e.g. numbers
  • logical Binary two possible values
  • character e.g. words between "
  • comment: line starting by #
# This is a comment line
# I can write everythin I want

Tip

Comment your script to help you remember what you have done.

Objects - Overview

Complex type

  • vector: Ordered collection of elements of the same type
  • list: Flexible container, mixed type possible. Recursive

Objects - Overview

Complex type

  • matrix: Table of elements of the same type
  • data.frame: Table of mixed type elements

Note

These are the basic complex types. It exists a lot of different complex objects which mix all these basic objects.

Objects - Naming conventions

  • Use letters, numbers, dot or underline characters
  • Start with letter (or the dot not followed by a number)
  • Some names are forbidden (eg. if, else, TRUE, FALSE)
  • Correct: valid.name, valid_name, valid2name3
  • Incorrect: valid name, valid-name, 1valid2name3

Tip

Avoid random names such as var1, var2. Use significant names: gene_list, nb_elements

Objects - Assign a value

The name of the object followed by the assignment symbol and the value.

Objects - Arithmetic operators

You can use operators on objects to modify them. Depending on the object format, operators have different behaviors and some are forbidden.

  • addition: +
  • subtraction: -
  • multiplication: *
  • division: /
  • exponent: ^ or **
  • integer division: %/%
  • modulo: %%

Objects - Arithmetic operators

Exercise

  1. Create a numeric object
  2. Multiply it by 6
  3. Add 21
  4. Divide it by 3
  5. Subtract 1
  6. Halve it
  7. Subtract its original value

Objects - Arithmetic operators

Correction

Objects - Arithmetic operators and errors

Some operations raise errors and others quietly return unexpected results.

Objects - Function

  • A function is a tool to create or modify an object
  • Format: function_name(object, parameter1 = ..., parameter2 = ...)
  • Read the help manual to know more about a function (help, ? or F1)

Objects - Function

  • A function is a tool to create or modify an object
  • Format: function_name(object, parameter1 = ..., parameter2 = ...)
  • Read the help manual to know more about a function (help, ? or F1)

Note

Some functions are in the default installation of R. Other functions come from packages. You can also create your own functions.

Vectors

Vectors Creation

  • c() Concatenate function is the most common way to create vectors
  • 1:10 Easy way to create a vector with numbers from 1 to 10

Vectors Creation

  • c() Concatenate function is the most common way to create vectors
  • 1:10 Easy way to create a vector with numbers from 1 to 10

Extra ways to create vectors

  • seq() Create a sequence of numbers
  • rep() Repeat elements several times
  • runif() Simulate random numbers from Uniform distribution. Same for rnorm(), rpois()

Exercise - Create some vectors

Instructions

  • Create a vector with 7 numeric values
  • Create a vector with 7 character values

Vectors Exploration

Using index/position between []

Vectors Characterics

  • length() Number of elements in the vector
  • names() Get or set the names of the vector’s value. No value by default.

Vectors Manipulation

  • sort() Sort a vector
  • sample() Shuffle a vector
  • rev() Reverse a vector

Extra

  • log/log2/log10 Logarithm functions
  • sqrt Square-root function

Vectors Summary

  • head()/tail() Print the first/last values
  • summary() Summary statistics
  • min()/max()/mean()/median()/var() Minimum, maximum, average, median, variance
  • sum Sum of the vector’s values

Vectors - Arithmetic operators

  • Simple arithmetic operations over all the values of the vector
  • Or values by values when using vectors of same length
  • Arithmetic operations: +, -, *, /
  • Other exist but let’s forget about them for now

Exercise

Instructions

  1. Create a vector with 100 random numeric values (hint: runif or rnorm)
  2. Subtract the average of those values
  3. Divide by the standard deviation
  4. Multiply all the values by 10
  5. Add 100 to all the values
  6. Compute summary statistics (minimum/maximum, median, mean)
  7. Compute the standard deviation of the new values

Matrix

Matrix - Creation

matrix(): Creates a matrix from a vector

rbind()/cbind(): Binds multiple vectors of a same length to create a matrix

Matrix - Manipulation

mat[i,j]: To select element at row i and column j. i and j can be vectors to select multiple elements.

Matrix - Manipulation

mat[i,j]: To select element at row i and column j. i and j can be vectors to select multiple elements.

t(): To transpose the matrix

rbind()/cbind(): To concatenate matrix vertically or horizontally

Exercise

Instructions

  1. Create a matrix with 10 rows and 4 column with numbers from 1 to 40
  2. Change the element in row 6, column 2 into value 42
  3. Fill the 3rd row with ones
  4. Remove the last column

Matrix - Manipulation

dim(): Returns the dimension of the matrix, e.g. number of rows and columns rownames()/colnames(): Get or set the names of the rows/columns

Matrix - Operations

length(), head(), tail()

For numeric matrix: min(), max(), sum(), mean(), summary()

Arithmetic operations: +, -, *, /

Exercise

Instructions

  1. Create a matrix with 100 rows and 4 column with random numbers
  2. Name the columns
  3. Add 2 to each element of the first column
  4. Multiply all the elements of the second column by 4
  5. Find which column has the largest mean value
  6. Find which column has the largest value

Matrix - apply() function

  • Apply a function to each row (or column) of a matrix
  • No manual iteration, the loop is implicit
  • Second parameter: 1 means row and 2 means columns

Exercise

Instructions

  1. Create a matrix with 100 rows and 100 column with random numbers
  2. Compute the median value of each column
  3. What is the minimal median value? Maximal?

Matrix - shortcut to apply

Some functions are wrappers of apply():

  • rowSums() equivalent to apply(, 1, sum)
  • colSums() equivalent to apply(, 2, sum)
  • colMeans() equivalent to apply(, 1, mean)
  • rowMeans() equivalent to apply(, 2, mean)

Reminder of last session

Objects

  • There are three basic types of objects: numeric, character and boolean
  • Basic objects can be combined into more complex objects, such as vectors, lists, matrices and data.frames
  • Objects need to be ‘stored’ in variables so they can be easily retrieved and modified
  • Basic and complex numeric objects can handle arithmetic operations
  • Functions can be applied to basic and/or complex objects depending on the type of data they are designed to process

Reminder of last session

Vectors

  • A vector is a sequence of elements that are all of the same type
  • There are multiple ways to create it
  • [] are used to manipulate elements of a vector
  • Numeric vectors support arithmetic operations
  • Elements in a vector can be named
  • Various functions can be used with vectors to modify or summarize them

Reminder of last session

Matrix

  • A matrix is a table containing elements of one data type
  • matrix(): function to create a matrix
  • [i,j]: used to manipulate and replace elements
  • rownames(), colnames(): get/set the row or column names
  • Numeric tables support arithmetic operations
  • apply(): run a function on each row or column

Data Frames

Data frames - Creation

A data frame is like a matrix but it can be composed of different data types.

  • data.frame() To create a data frame

  • as.data.frame() To transform a matrix into a data frame

Data frames - Manipulation

  • [] or $name To select elements of a data.frame
  • t() To transpose
  • head()/tail() To show parts of a data frame

Data frames - Summarize and concatenation

dim(): Returns the dimension of the data frame, e.g. number of rows and columns rownames()/colnames(): Get or set the names of the rows/columns rbind()/cbind(): To concatenate data frame vertically or horizontally

Exercise

Instructions

  1. Create a data frame of 100 rows with a character column and two numeric columns
  2. Display the first few rows of this data frame
  3. Calculate the mean of the two numeric columns
  4. Add a new column containing boolean values

Data frames - Calculation

You can use arithmetic operations (+-*/) only on numeric columns. Functions like apply() can be used as with a matrix.

Exercise

Instructions

  1. Create a data frame with 2000 random gene names, and 3 columns with random numerics between 0 and 1
  2. Multiply numeric column 1 by 100, numeric column 1 by 200 and numeric column 3 by 300
  3. Find the column with the highest average expression
  4. Find the gene with the lowest median expression

Import/Export data

Import/Export data - Where is my data ?

File path

To tell R where to find your data, you need to specify the path to the file. There are two types of paths for the same file:

  • Absolute Path: This is the full path to the file, which you can see in the address bar when you are in a folder.
  • Relative Path: This describes how to navigate from one folder to another to reach the file.

Paths are always enclosed in double quotes (").

Import/Export data - Where is my data ?

Working directory

Change the directory where R is working

``

Import/Export data - Read a text file

Easy but important

  • What data structure is more appropriate? vector, matrix?
  • Does R read/write the file the way you want?
  • The extra parameters of the functions are your allies

Import/Export data - Read a text file

read.table()

To read a data.frame from a multi-column file

  • file=the path to the file
  • header= Set to TRUE if the 1st line correspond to column names
  • as.is= Set to TRUE to read the values as simple type, recommended
  • sep= The character that separate the columns, e.g. , or \t (tabs)
  • row.names= The column number to use as row names ``
input.data = read.table("path/to/my/file.txt", as.is = TRUE,
                        header = TRUE, sep = "\t", row.names = 1)

Exercise

Instructions

Import dataForBasicPlots.tsv into an object called mat.ge

  1. How many genes are there?
  2. How many samples are there?
  3. Print the first 5 rows and columns

Import/Export data - Write a text file

write.table()

To write a data.frame in a multi-column file

  • df the data.frame to write
  • file= the file name
  • col.names= TRUE print the column names in the first line
  • row.names= TRUE print the row names in the first columns
  • quote= TRUE surround character by double quotes
  • sep= the character that separates each column. ’ ’ by default.
write.table(resToWrite, file = "path/to/file.txt", col.names = TRUE,
            row.names = FALSE, quote = FALSE, sep = "\t")

Import/Export data - R objects

  • saveRDS() Save one R object into a file. Use .rds extension
  • save() Save multiple R objects into a file. Use .RData as extension
  • save.image() Save the entire R environment
  • readRDS() Read a R object from a .rds file
  • load() Load R objects from a .RData file
saveRDS(luckyNumbers, file = "my_vector.rds")
new_luckyNumber = readRDS("my_vector.rds")

save(luckyNumbers, tenOnes, mat, file = "uselessData.RData")
load("uselessData.RData")

Import/Export data - Save plots

Easy way

Import/Export data - Save plots

Automatic way

  1. Open the connection to an output file (pdf(), png(), jpeg()…)
  2. Plot as usual
  3. close the connection with dev.off()
pdf("/path/to/myNicePlot.pdf")
plot(...)
dev.off()

Basic plotting

Basic plotting - Functions

hist(x)

Plot a histogram, eg. the value distribution of a vector.

x The vector with the values to plot

Basic plotting - Functions

barplot(x)

Plot a barplot, eg. one bar for each value of a vector.

x The vector with the values to plot

Basic plotting - Functions

plot(x, y)

Plot one vector against the other using points for each element.

x The first vector to plot (x-axis)

y The second vector to plot (y-axis)

type How the points are plotted. “p” for points, “l” for points joined by lines

Basic plotting - Functions

boxplot(x)

Plot the distribution of variables.

x The matrix of distributions

Basic plotting - Common parameters

main= A title for te plot

xlab=/ylab= A title for the x/y axis

xlim=/ylim= A vector of size two defining the desired limits on the x/y axis

Basic plotting - Extra

Extra parameters

col= The colour of the points/lines

pch= The shape of the points

lty= The shape of the lines

Extra functions

lines() Same as plot but super-imposed to the existent one abline() Draw a vertical/horizontal line

Exercise

Instructions

Generate plots using the dataForBasicPlots.tsv file and save them:

  1. a boxplot of columns 1 to 10.
  2. the distribution of the median gene expression. Add a vertical dotted line to mark the average of the median gene expression.
  3. the expression of gene 333 against gene 666. Surimpose in red triangles the expression of gene 333 against gene 667.

Reminder of last session

“Table” objects

  • A matrix is a table containing elements of one data type

  • A data.frame is a table with columns of different data types

  • matrix(), data.frame(): functions to create a matrix or a data frame

  • [i,j]: used to manipulate and replace elements

  • rownames(), colnames(): get/set the row or column names

  • Numeric columns support arithmetic operations

  • apply(): run a function on each row or column

Reminder of last session

Read and write files in R

  • Set your working directory, eg. the folder where you are working (Session > Set Working Directory)

  • You can get the path of your file if you copy/paste it on R

  • read.table(): Basic function to read a table

  • write.table(): Basic function to write a table

  • pdf()/png()/jpeg() ... dev.off(): Save a plot in PDF/PNG/JPEG format automatically

  • You can save R objects/environments (saveRDS(), save(), save.image())

  • You can load R objects/environments (readRDS(), load())

Reminder of last session

Plot

  • There are different functions to plot data

  • barplot()

  • hist()

  • boxplot()

  • plot() : Scatter plot with points and/or lines

  • Some parameters can be used to custom your plot : main=, xlab=, xlim=

Conditions

Conditions - Logical values

Logical type

TRUE / FALSE values

Conditions - Logical tests

== Are both values equal?

> or >= Is left value greater (or equal) than right value?

< or <= Is left value smaller (or equal) than right value?

! NOT operator, negates the value

| OR operator, returns TRUE if either are TRUE

& AND operator, returne TRUE if both are TRUE

Conditions - Vectorized operations

Any logical test can be vectorized.

which returns the index of the vectors with TRUE values

Exercise

Instructions

  1. Create a vector of random integer numbers between 0 and 10
  2. Remove values below 3
  3. Change to 8 any value higher than 8
  4. On mat.ge, remove all genes with median expression lower than 1

Conditions - Testing conditions

if else

Test a condition. If TRUE, runs some instructions. If FALSE, runs something else or nothing.

# example 1
if( Condition ){
... Instructions
} else {
... Instructions
}
# example 2
ifelse(condition, instruction if TRUE, instruction if FALSE)

Exercise

Instructions

Write a if block that automatically classify the expression of the first gene of mat.ge into:

  • ‘high’ if its maximum value is higher than 4
  • ‘low’ if not

Functions

Functions - Definition

  • Name of the function with parameters between parenthesis
  • Takes input(s) and returns something, e.g. mean(luckyNumbers)
  • Can have mandatory and optional parameters

Functions - Creation

  • Start with function() to define a function
  • All the objects created within the function are temporary
  • return() specifies what will be returned by the function
myFunctionName <- function(input.obj1,second.input.obj ) {}

#Instructions on 'input.obj1' and 'second.input.obj'
#to generate 'my.output.obj'

return(my.output.obj)
}

myFunctionName(1,c(2,4,5))

Functions - Concept

Functions - Example

This function takes a vector as input and:

  • removes values lower than 3
  • replace values higher than 8 by 8

Exercise

Instructions

Create a function that classify the average value of a vector. It returns:

  • small if the average is below 3
  • medium if the average is between 3 and 7
  • high if the average is above 7

Test your function on vectors with random numbers from 0 to 10.

Final exercise

Instructions

  1. Load metadata.RData file. It has a groups vector with either case/control status for the mat.ge samples.
  2. Write a function that would compute the difference between the gene expression of cases and controls.
  3. Apply this function to each gene in mat.ge
  4. Plot the distribution of the results

Loops

Loops - Definition

for loops

Iterates over the elements of an iterator and runs intructions.

for(v in vec){
... Instruction
}

while loops

Runs instructions as long as a condition is TRUE.

while( CONDITION ){
... Instruction
}

Exercise

Instructions

Write a function that computes the mean values of the columns:

  • using the apply function
  • using a for loop
  • using a while loop

Extra

Extra - Character operations

paste() Pastes several characters into one

grep() Searches for a pattern in a vector and returns the index when matched

grepl() Searches a pattern in a vector and returns TRUE if found

strsplit() Splits a string

Extra - Type coercion

  • Automatic conversion of an object to another type, e.g numeric -> character, logical -> numeric
  • Awareness for debugging
  • Useful sometimes

Extra - One-liner quiz

Instructions

Write R commands to address each question. Only one_line command allowed. The shorter the better.

  1. From a matrix of numeric, compute the proportion of columns with average value higher than 0.
  2. From a matrix of numeric, print the name of the column with the highest value.
  3. From a matrix of numeric, print the rows with only positive values.