Day 3. Data Analysis and Graphics with R.

R is powerful data programming language and environment for statistical computing, data analysis and graphics. R is typically used to explore and understand data in an open-ended, highly interactive, iterative way. Learning R will give you the freedom to experiment and problem solve during data analysis — exactly what we need as bioinformaticians and data scientists.

Before getting our hands dirty working with real data in R, we need to learn the basics of the R language. Even if you’ve poked around in R and seen these concepts before, I would still recommend you follow along and complete the free online interactive learning tutorial. This will take you through a gentle introduction to R syntax and some of the major R data structures (called vectors, matrices data.frames and lists) that we will cover in more detail in class.


Schedule:

Session Time Topics
I 9:00-10:15 AM Introduction to R
  10:15-10:30AM Coffee Break
II 10:30-12:00 AM R Control Structures and Functions
  12:00-1:00PM Lunch
III 1:00-2:15 PM Data Exploration and Visualization in R
  2:15-2:30 PM Coffee Break
IV 2:30-4:00 PM Working with R packages from CRAN & Bioconductor


Instructors:

Armand Bankhead (AB)


Topics:

I) Introduction to R [1.25 hr] slides

  • What is R and Why Use it?
  • Ways to Use R
  • R as a Statistical Programming Language
  • Writing and Running R Scripts
  • Data Types
  • Data Structures
  • Vector and Matrix Operations

—- Coffee Break [15 mins] —

II) R Control Structures and Functions [1.5 hr] slides

  • Working Directory
  • Reading and Writing Data in R
  • Factors
  • Using Indexes
  • Merging Data Frames
  • Functions
  • Program Control Structures

—- Lunch Break [1 hr] —

III) Data Exploration and Visualization in R 1.25 hr slides

  • Summarizing Data in R
  • Creating Plots in R

—- Coffee Break [15 mins] —

IV) Working with packages from CRAN & Bioconductor [1.5 hr] slides

  • CRAN (Comprehensive R Archive Network)
  • Bioconductor, a bioinformatics package repository
  • Package Installation
  • Package Documentation
  • Package Source Code
  • Tidyverse
  • Example: BiomaRt Bioconductor Package

—- End/Wrap-Up —


Datasets

Pedersen Log2RPKM Gene Expression Data: file1 file2

Reference material

RStudio cheatsheet: A well designed reference card for RStudio features

ggplot2 cheatsheet: A pragmatic reference creating ggplot2 visualizations

R for Data Science: A brand new O’Reilly book, available free online, that will teach you how to do data science with R

Class notes on R language basics

Class notes on useful R functions for working with strings