BIOSTAT/BIOINF/HUMGEN606 (2019) Day 5

Preamble

This is an R-version of Biocomputing Bootcamp Dat 5 material, which was written in python notebook first at https://colab.research.google.com/drive/10toqlqt1HYluzFLVW5a13cWzbWoXBuob. It is recommended to finish the python version first, unless the student definitely wants to skip the python part. Many parts of of this notebook assumes that you’ve already finished the python part, although it keeps redundant information for the R-only users.

Overview of single cell RNA-seq (scRNA-seq) data

Droplet-based single-cell RNA-seq is a new technology that allows us to profile thousands of single cell transcriptomes with a single library preparation. We will use publicly available dataset to perform basic analysis using the skillsets we have learned during this week. See the overview lecture (posted in the course web site) to have some basic understandings of scRNA-seq technology

Downloading PBMC 3k data

Download the PBMC 3k data

We will download 2,700 PBMC cells publicly available from 10X Chromium web site. Please use a web browser to download the file.

Uncompress the downloaded file.

Uncompress the downloaded file using a file browser (Windows Explorer or Finder). Extract the file, then you will see a directory named filtered_gene_bc_matrices. Here we will assume that you downloaded the files to a directory named ~/Downloads (~ represents your current home directory in most systems). If you downloaded the file in another directory, you will need to make a corresponding change.

[DIY] Check whether the file exists or not.

Use the following codes to make sure that the file exists. You may need to modify indir in the code below. We will read first ten tokens to see how the files look like.

## [DIY] Make corresponding changes in the code below if needed.
indir = "~/Downloads/filtered_gene_bc_matrices/hg19"
print(scan(paste0(indir,"/barcodes.tsv"),what=character(),nmax=10))

##  [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
##  [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"
##  [7] "AAACGCTGACCAGT-1" "AAACGCTGGTTCTT-1" "AAACGCTGTAGCCA-1"
## [10] "AAACGCTGTTTCTG-1"

print(scan(paste0(indir,"/genes.tsv"),what=character(),nmax=10))

##  [1] "ENSG00000243485" "MIR1302-10"      "ENSG00000237613"
##  [4] "FAM138A"         "ENSG00000186092" "OR4F5"          
##  [7] "ENSG00000238009" "RP11-34P13.7"    "ENSG00000239945"
## [10] "RP11-34P13.8"

print(scan(paste0(indir,"/matrix.mtx"),what=character(),nmax=10))

##  [1] "%%MatrixMarket" "matrix"         "coordinate"     "real"          
##  [5] "general"        "%"              "32738"          "2700"          
##  [9] "2286884"        "32709"

Q Do you understand what how the file look like? If unclear, use text editor to load the file, or use read.table() function instead to read the data in a tabular form.

Reading the input file into a matrix

[DIY] Write read_dge() function

Your first assignment is to write a R function that takes three input files - bcdf (e.g. barcode.tsv), genef (e.g. genes.tsv), mtx (e.g. matrix.mtx) - to return a numpy matrix and associated lists. The detailed specification of the function is given in the code segment below.

You may use the read.table() or scan() functions to read three files

#' read_dge() : function to read digital expression matrix in 10x genomics format
#' 
#' This function takes three files as input to return a digital expression matrix.
#' Each input file should contain information of cell barcodes, genes, and UMI counts with 
#' specific formats described below. The function will return a matrix where each row
#' represents a droplet barcode and each column represents a gene.
#' 
#' @param bcdf  Barcode file (typically barcodes.tsv) containing one cell barcode at each line
#' @param genef Gene file (typically genes.tsv) containing [GENE_ID] and [GENE_NAME] at each line.
#' @param mtxf  Sparse matrix file (typically matrix.mtx) containing [GENE_INDEX] [BARCODE_INDEX] #'              [UMI_COUNT] from 4-th line for any non-zero entry.
#' @return A matrix of ([TOTAL_BARCODE_COUNTS] x [TOTAL_GENE_COUNTS]) containing a digital
#'         expression matrix. Each row name represents barcodes and column name represents
#'         gene name (not gene id)
read_dge <- function(bcdf, genef, mtxf) {
    bcds <- read.table(bcdf)
    genes <- read.table(genef)
    mtxs <- read.table(mtxf,skip=3)
    nbcds <- nrow(bcds)
    ngenes <- nrow(genes)
    mat <- matrix(0, nbcds, ngenes)
    rownames(mat) <- bcds[,1]
    colnames(mat) <- genes[,2]
    for(i in 1:nrow(mtxs)) {
      mat[mtxs[i,2],mtxs[i,1]] = mtxs[i,3]
    }
    return(mat)
}

Using read_dge() function to load digital expression matrix

Once you implemented the read_dge() function, you can run the function using the following specific arguments.

# Run read_dge() function. This may take a while
dge <- read_dge(paste0(indir,"/barcodes.tsv"),paste0(indir,"/genes.tsv"),paste0(indir,"/matrix.mtx"))

Sanity check of the return values

Use the following codes to see where the returned matrix conforms to the expected format. It is expected to be a 2700 x 32738 matrix, with first row name AAACATACAACCAC-1 and the first column name MIR1302-10.

print(dim(dge))

## [1]  2700 32738

print(dge[1:5,1:5])

##                  MIR1302-10 FAM138A OR4F5 RP11-34P13.7 RP11-34P13.8
## AAACATACAACCAC-1          0       0     0            0            0
## AAACATTGAGCTAC-1          0       0     0            0            0
## AAACATTGATCAGC-1          0       0     0            0            0
## AAACCGTGCTTCCG-1          0       0     0            0            0
## AAACCGTGTATGCG-1          0       0     0            0            0

Summarize and visualize basic statistics

Using ggplot2 package, we will summarize and visualize the basic statistics from digital expression matrix.

[DIY] Asking basic questions about the digital expression matrix

We do not know much about the data we have just loaded yet. Let’s try to find out the answer for the following questions.

What is the total number of UMis across all genes and barcodes?
How sparse is the digital matrix? How many barcode/gene pairs have nonzero elements?
Are theres gene that are not expressed at all? If so, how many genes as such?

Each of these questions can be answered with one or two lines of simple R code.

# DIY -- Write simple lines of codes to answer to each of these basic questions

# 1. What is the total number of UMis across all genes and barcodes?
# Write your code here
print(sum(dge))

## [1] 6390631

# 2. How sparse is the digital matrix? How much proportion of barcode/gene pairs have nonzero elements?
# Write your code here
print(sum(dge>0) / length(dge))

## [1] 0.02587189

# 3. Are theres gene that are not expressed at all? If so, how many such genes exist?
# Write your code here
print(sum(colSums(dge)>0))

## [1] 16634

[DIY] Collect summary statistics and visualize them for each barcoded droplet

The next, task is to collect basic summary statistics for each barcoded droplet. * Create a dataframe named df_bcd with the following two summary statistics for each barcode. * nUMI : The total sum of UMI counts across all genes for given barcoded droplet. * nGene : The number of expressed genes (genes with positive UMI counts) for given barcoded droplet. * The row names of dataframe should be the droplet barcode. * Upon successful creation, the first five rows should look like the following example.

print(head(df_bcd))
                    nUMI  nGene
AAACATACAACCAC-1  2421.0    781
AAACATTGAGCTAC-1  4903.0   1352
AAACATTGATCAGC-1  3149.0   1131
AAACCGTGCTTCCG-1  2639.0    960
AAACCGTGTATGCG-1   981.0    522

# Summarize basic statistics for each barcode 
# [DIY] Here, create a dataframe named df_bcd, with nUMI and nGene. 
df_bcd <- data.frame(nUMI=rowSums(dge), nGene=rowSums(dge>0), row.names=rownames(dge))
print(head(df_bcd))

##                  nUMI nGene
## AAACATACAACCAC-1 2421   781
## AAACATTGAGCTAC-1 4903  1352
## AAACATTGATCAGC-1 3149  1131
## AAACCGTGCTTCCG-1 2639   960
## AAACCGTGTATGCG-1  981   522
## AAACGCACTGGTAC-1 2164   782

Summarize the summary statistics for each barcoded droplet

Use summary() function to summarize each attribute in the dataframe created above.

# Calculate the basic summary statistics
print(summary(df_bcd))

##       nUMI           nGene       
##  Min.   :  548   Min.   : 212.0  
##  1st Qu.: 1758   1st Qu.: 690.0  
##  Median : 2197   Median : 817.0  
##  Mean   : 2367   Mean   : 847.0  
##  3rd Qu.: 2763   3rd Qu.: 953.2  
##  Max.   :15844   Max.   :3422.0

Draw the histogram of each summary statistics.

You may use ggplot2 and geom_histogram to draw the distribution of each variable. as shown in the examples below.

library(ggplot2)
# draw a histogram of nUMI and nGene
# Check the output and see if they make sense.
print(ggplot(data=df_bcd) + aes(x=nGene) + geom_histogram(binwidth=100,color='black',fill='grey'))

print(ggplot(data=df_bcd) + aes(x=nUMI) + geom_histogram(binwidth=100,color='black',fill='grey'))

[DIY] Visualize the relationship between the summary statistics

Draw a scatterplot where x-axis is nGene and y-axis is nUMI using the dataframe df_bcd. Using ggplot2 package is recommended.

# Show the relationship between nUMI and nGene by cells
# [DIY] Here, create a ggplot as specified.
print(ggplot(data=df_bcd) + aes(x=nGene,y=nUMI) + geom_point())

[DIY] Add transparency to the plots

The plots you create will have area with too much overlaps with points to understand which area is more densely populated with points. Add alpha parameter (e.g. value of 0.1) to add transparency in the points.

# Add alpha = 0.1 parameter to add transparency
# [DIY] Here, enter the modified line of ggplot command
print(ggplot(data=df_bcd) + aes(x=nGene,y=nUMI) + geom_point(alpha=0.1))

[DIY] Change the axis in log scale.

You may have noticed that the distribution may look more symmetric if it was plotted in log-scale. Change both x-axis and y-axis into log-scale.

Note that it is safer to plot nGene+1 and nUMI+1 to avoid losing data points with zero values.

# Plot nGene+1 and nUMI+1 in log10 scale in both axis
# [DIY] Here, enter the modified line of ggplot command
print(ggplot(data=df_bcd) + aes(x=nGene+1,y=nUMI+1) + geom_point(alpha=0.1)+scale_x_log10()+scale_y_log10())

[DIY] Collect summary statistics and visualize them for each gene

Similarly, we can collect basic summary statistics for each gene, across all barcodes. * Create a dataframe named df_gene with the following two summary statistics for each gene. * nUMI : The total sum of UMI counts across all barcodes for the given gene. * nBCD : The number of droplet barcodes expressing the gene (with positive UMI counts). * The dataframe should be indexed by each gene name. * Upon successful creation, the first five rows should look like the following example.

print(head(df_gene))
              nUMI  nBCD
MIR1302-10     0       0
FAM138A        0       0
OR4F5          0       0
RP11-34P13.7   0       0
RP11-34P13.8   0       0
AL627309.1     9       9

# Summarize basic statistics for each gene
# [DIY] Here, create a dataFrame named df_gene, with nUMI and nBCD as attributes 
# This task can be done in as few as one line
# Check out pandas cheatsheet and tutorial if needed.
df_gene  = data.frame(gene=colnames(dge), nUMI = colSums(dge), nBCD = colSums(dge>0))
print(head(df_gene)) # This command will allow you to check whether the output is the same as expected.

##           gene nUMI nBCD
## 1   MIR1302-10    0    0
## 2      FAM138A    0    0
## 3        OR4F5    0    0
## 4 RP11-34P13.7    0    0
## 5 RP11-34P13.8    0    0
## 6   AL627309.1    9    9

Summarize the summary statistics for each gene

Use summary() function to summarize each attribute in the pandas dataframe created above.

Q Compared to df_bcd examples, what is the outstanding difference in the summary statistics by gene?

# calculate the basic summary statistics for each gene
print(summary(df_gene))

##        gene            nUMI               nBCD        
##  7SK     :    4   Min.   :     0.0   Min.   :   0.00  
##  NPIPA7  :    3   1st Qu.:     0.0   1st Qu.:   0.00  
##  Y_RNA   :    3   Median :     1.0   Median :   1.00  
##  ACE     :    2   Mean   :   195.2   Mean   :  69.85  
##  ALG9    :    2   3rd Qu.:    37.0   3rd Qu.:  35.00  
##  ANKRD18A:    2   Max.   :161685.0   Max.   :2700.00  
##  (Other) :32722

[DIY] Draw the histogram of each summary statistics.

You may use geom_histogram to draw the distribution of each variable in df_gene.

# draw a histogram of nUMI and nBCD
# Use geom_histogram, similar to the examples in df_bcd.
print(ggplot(data=df_gene) + aes(x=nBCD) + geom_histogram(binwidth=100,color='black',fill='grey'))

print(ggplot(data=df_gene) + aes(x=nUMI) + geom_histogram(binwidth=1000,color='black',fill='grey'))

[Freeform] Explore the dataset more

You may want to explore the dataset we loaded in various other ways. Use this area as a sandbox to create any exploratory analysis on the summary statistics you’ve generated.

# This is a sandbox area. Feel free to add any additional analysis 
# to better understand the data you've loaded.

Normalization and manifold learning.

Next, we will perform a simple normalization of the expression matrix and will perform manifold learning on the normalized expressions.

[DIY] Log-normalization

As shown in the summary statistics of each gene, there are relatively a small number of genes that are much highly expressed than other genes, so they may explain most of the variations between the cells. To alleviate such effects, one common practice is to convert the data using a simple log-normalization.

Suppose \(X \in \{0,1,2,\ldots\}^{B \times G}\) be the matrix of UMI counts across \(B \in \{1,2,\ldots\}\) barcoded droplets and \(G \in \{1,2,\ldots\}\) genes. Then the sum of UMI counts across the gene for each barcoded droplet is

\[ C_b = \sum_{g=1}^G X_{bg} \]

Then the normalized expression matrix \(Y \in [0,\infty)^{B \times G}\) can be represented as

\[ Y_{bg} = \log\left[1+X_{bg}\frac{F}{C_b}\right] \]

where \(F > 0\) is a fixed scaling factor. We will use \(F = 10,000\) for our analysis.

Now, your task is create a new matrix norm_dge from the UMI count matrix dge, using the normalization equation given above using \(F = 10,000\). Use log1p() function if needed.

# Log-normalize the digital expression matrix
# The input matrix is dge
# The output matrix is to be named as norm_dge
# [DIY] Calculate norm_dge from dge
norm_dge <- log1p(dge / matrix(df_bcd$nUMI/10000,nrow(dge),ncol(dge),byrow=FALSE))

Manifold learning using UMAP

UMAP (Uniform Manifold Approximation and Projection) is an emerging method to create low-dimensional manifold from high-dimensional data, and it is one of the most popularly used method for understanding the cell types of single cell genomic data. For more information, refer to the following resource. * https://umap-learn.readthedocs.io/en/latest/ * https://arxiv.org/abs/1802.03426 We will use uwot package, which is an R implementation of UMAP to take norm_dge as input to create 2-dimensional manifold umap_dge for each barcoded droplet. First, we will make sure that uwot is installed in your R environment. If not,

## Check if uwot package is installed.
if ( require('uwot') == FALSE ) {
  install.packages('uwot') ## install if not already installed
  library('uwot')
} else {
  print("uwot package is already installed")
}

## Loading required package: uwot

## Loading required package: Matrix

## [1] "uwot package is already installed"

We will compute UMAP coordinates using uwot package, with default parameter, except that we will first compute 20 principal components (PCs) and perform UMAP on the PCs, to save the computational time.

## Calculate umap from normalized dge, using 20 PCs
## Report the elapsed time
print( system.time( umap_dge <- umap(norm_dge, pca=20) ) )

##    user  system elapsed 
##  24.903   1.903  30.436

Let’s take a peek of the output from the UMAP procedure above.

print(dim(umap_dge))

## [1] 2700    2

print(head(umap_dge))

##           [,1]        [,2]
## [1,]  3.072517  -3.0550035
## [2,] -6.050866 -10.0906606
## [3,]  3.684618   0.1082747
## [4,] -2.380720  11.6514998
## [5,]  4.270111  -9.2690547
## [6,]  3.822707  -0.9010277

[DIY] Add attributes to the existing dataframe

Modify the dataframe df_bcd by adding two additional attributes - UMAP1 and UMAP2 - using umap_dge computed above. This will allow us to visualize the UMAP manifolds with respect to other summary statistics.

# [DIY] Add 'UMAP1' and 'UMAP2' attributes to the DataFrame df_bcd
# Use two columns from umap_mtx as 'UMAP1' and 'UMAP2', respectively.
df_bcd$UMAP1 = umap_dge[,1]
df_bcd$UMAP2 = umap_dge[,2]

# Just to peek the beginning of the modified dataframe. 
# Make sure UMAP1 and UMAP2 exist in the output
print(head(df_bcd))

##                  nUMI nGene     UMAP1       UMAP2
## AAACATACAACCAC-1 2421   781  3.072517  -3.0550035
## AAACATTGAGCTAC-1 4903  1352 -6.050866 -10.0906606
## AAACATTGATCAGC-1 3149  1131  3.684618   0.1082747
## AAACCGTGCTTCCG-1 2639   960 -2.380720  11.6514998
## AAACCGTGTATGCG-1  981   522  4.270111  -9.2690547
## AAACGCACTGGTAC-1 2164   782  3.822707  -0.9010277

[DIY] Visualize the UMAP manifolds

Create a scatterplot where x-axis is UMAP1, and y-axis is UMAP2 and each point is colored by nUMI attributes, where each point represents a barcoded droplet. Make sure that points are reasonably distinguishable by modifying the size and transparency of the points.

Q Does the output plot make sense to you? Can you interpret how they are clustered?

# [DIY] Create a scatterplot, with the following specs: 
# * x-axis is UMAP1 
# * y-axis is UMAP2
# * color of each point is nUMI
# * adjust the size and transparency of the points for better presentation
print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=nUMI) + geom_point(size=0.5,alpha=0.3))

Cell-type-specific analysis

The UMAP manifolds we generated above give hints about cell-type-specific nature of the single-cell transcriptomic profiles.

Using existing knowledges of these PBMC cell types, we will visualize the distribution of gene expressions across the manifold space.

As suggested https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html , these are (incomplete) examples biomarkers representing known cell types.

Marker Genes	Cell Types (in the tutorial)	Cell Types from Other Sources
IL7R	CD4+ T cell	T helper cell
CCR7	Naive CD4+ T cell	Naive T cell
S100A4	Memory CD4+ T cell	Memory T-cell
CD14	CD14+ Monocyte	Monocytes, Macrophages, and Dentritic Cell
LYZ	CD14+ Monocyte	Monocytes, Macrophages, and Dentritic Cell
MS4A1	B cell	B-cell
CD8A	CD8+ T cell	Cytotoxic T-cell
FCGR3A	FCGR3A+ Monocyte	CD16+ Monocytes and NK cell
MS4A7	FCGR3A+ Monocyte	CD16+ Monocytes
GNLY	NK cell	Natural Killer (NK) cell
NKG7	NK cell	Natural Killer (NK) cell
FCER1A	Dentritic cell	Dendritic cell
CST	Dentritic cell	Dendritic cell
PPBP	Platelet	Platelet

[DIY] Visualize the distribution of marker gene expressions on UMAP manifolds

In the code segment below marker_genes and marker_types contains the gene names and cell types enriched for the particular cell types. For each marker gene, perform the following tasks

Add an additional column to df_bcd for each marker gene, by using the gene name as the column name and taking the values from corresponding genes in norm_mtx.
Visualize the expression levels of each marker gene by using UMAP1 attribute as x-axis, and UMAP2 attributes as y-axis, and the normalized expression of the marker gene as the color.
Use scale_color_gradient() function to chose an effective color scheme.
Use ggtitle() to give informative title (e.g. gene name and enriched cell types) for each of the plot.

marker_genes = c('IL7R', 'CCR7', 'S100A4', 'CD14', 'LYZ', 'MS4A1', 'CD8A', 'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP')
marker_types = c('CD4+ T','Naive CD4+ T','Memory CD4+ T','CD14+ Mono','CD14+ Mono','B','CD8+ T','FCGR3A+ Mono','FCGR3A+ Mono','NK','NK','Dendritic','Dendritic','Platelet')

# [DIY] For each marker gene, add additional column to df_bcd to represent normalized gene expression levels.
for (i in 1:length(marker_genes)) {
  ig = which(colnames(norm_dge) == marker_genes[i])
  df_bcd[[marker_genes[i]]] = norm_dge[,ig]
}

print(head(df_bcd)) # This line will give an idea whether each marker gene was properly added as individual columns

##                  nUMI nGene     UMAP1       UMAP2     IL7R     CCR7
## AAACATACAACCAC-1 2421   781  3.072517  -3.0550035 2.594626 1.635208
## AAACATTGAGCTAC-1 4903  1352 -6.050866 -10.0906606 1.625141 0.000000
## AAACATTGATCAGC-1 3149  1131  3.684618   0.1082747 1.429261 0.000000
## AAACCGTGCTTCCG-1 2639   960 -2.380720  11.6514998 0.000000 0.000000
## AAACCGTGTATGCG-1  981   522  4.270111  -9.2690547 0.000000 0.000000
## AAACGCACTGGTAC-1 2164   782  3.822707  -0.9010277 0.000000 0.000000
##                    S100A4 CD14      LYZ    MS4A1     CD8A   FCGR3A
## AAACATACAACCAC-1 2.594626    0 1.635208 0.000000 1.635208 0.000000
## AAACATTGAGCTAC-1 0.000000    0 1.962726 2.583047 0.000000 0.000000
## AAACATTGATCAGC-1 2.826014    0 1.994867 0.000000 0.000000 0.000000
## AAACCGTGCTTCCG-1 4.121133    0 4.521175 0.000000 0.000000 1.566387
## AAACCGTGTATGCG-1 3.950636    0 0.000000 0.000000 0.000000 0.000000
## AAACGCACTGGTAC-1 3.751611    0 1.726522 0.000000 0.000000 0.000000
##                     MS4A7     GNLY     NKG7 FCER1A     CST3     PPBP
## AAACATACAACCAC-1 0.000000 0.000000 0.000000      0 0.000000 0.000000
## AAACATTGAGCTAC-1 0.000000 0.000000 1.111715      0 1.111715 0.000000
## AAACATTGATCAGC-1 0.000000 1.429261 0.000000      0 1.429261 0.000000
## AAACCGTGCTTCCG-1 2.515108 0.000000 1.566387      0 4.435152 1.566387
## AAACCGTGTATGCG-1 0.000000 3.452557 4.728542      0 0.000000 0.000000
## AAACGCACTGGTAC-1 0.000000 0.000000 0.000000      0 0.000000 0.000000

# [DIY] For each marker gene, create a scatter plot of x-axis:UMAP1, y-axis:UMAP2, color:normalized expression levels.
#       Choose an effective visualization scheme yourself.
for(i in 1:length(marker_genes)) {
  g = marker_genes[i]
  t = marker_types[i]
  print(ggplot(data=df_bcd) + aes(x=UMAP1,y=UMAP2,color=get(g)) + geom_point(size=0.5,alpha=0.3) + scale_colour_gradient(low='#cccccc',high='#0000ff') + labs(color=g) + ggtitle(paste0('Gene : ',g,' Type : ',t)))
}

Clustering by cell types

We can cluster each barcoded droplets by cell types. While there are clearly better ways to cluster cell types, here we will use a simplistic approach - k-means clustering based on the UMAP coordinates - as a quick approach.

We will use kmeans() function to perform a k-means clustering, and the example codes are given below. Because there are four major clusters observed, we will identify 4 clusters.

set.seed(606)
umap_kmeans <- kmeans(umap_dge, 4)
df_bcd$kmeans <- umap_kmeans$cluster
print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=factor(kmeans)) + geom_point(size=0.5,alpha=0.3) + labs(color='Cluster'))

Note that the clusters do not look exactly the same to the python examples due to many differences in the details of UMAP and kmeans algorithm. The four clusters roughly corresponds to (1) Monocyte + Dendritic cells, (2) B cells and platelets, (3) CD8+ T & NK cells, (4) CD4+ T cells.

[DIY] Visualize the expression levels of marker genes by clusters

Now, repeat the visualization of marker genes expressions performed previously, but now separating the plots by the clusters from k-means clustering. You may use facet_wrap() to achieve this functionality.

# [DIY] For each marker gene, create scatter plots of barcoded droplets.
#   x-axis : UMAP1
#   y-axis : UMAP2
#   color  : marker gene expression
#   facet  : cluster
#   Use gene name and cell types in the title of each plot.
for(i in 1:length(marker_genes)) {
  g = marker_genes[i]
  t = marker_types[i]
  print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=get(g)) + geom_point(size=0.2,alpha=0.3) + scale_colour_gradient(low='#cccccc',high='#0000ff') + facet_wrap(~factor(kmeans),nrow=1) + labs(color=g) + ggtitle(paste0('Gene : ',g,' Type : ',t)) )
}

[DIY] Identify genes strongly enriched for each cluster

Finally, we will identify genes that are differentially expressed between different cell types (i.e. clusters). Specifically, we will identify marker genes that shows statistically significant enrichment in each particular cluster. To achieve this, you need to implement a function ttest_mtx() that performs t-tests between two sets of rows across each individual columns.

For \(X = (X_1,\ldots,X_{n_X})\) and \(Y = (Y_1,\ldots,Y_{n_Y})\), the equation for unpaired student t-test statistic follows

\[ T = \frac{\overline{X} - \overline{Y}}{s_p\sqrt{\frac{1}{n_X}+\frac{1}{n_Y}}} \]

where

\[ s_p^2 = \frac{\sum_{i=1}^{n_X}(X_{i}-\overline{X})^2 + \sum_{j=1}^{n_Y}(Y_{j}-\overline{Y})^2}{n_X+n_Y-2} \]

The resulting \(T\) is expected to follow a Student’s t-distribution with \(n_X+n_Y-2\) degrees of freedom under the null hypothesis.

You will need to calculate \(T\) and corresponding p-values for each column separately, and return a dataframe containing the following three attributes:

T-test statistic
p-values corresponding to the t-test statistic. Use scipy.stats.t.sf() to compute the p-value.
Log fold-change, calculated as \(\log_2 \left[\frac{\overline{X}}{\overline{Y}}\right]\)

The number of rows in the dataframe should be the same to the number of columns in the input matrix.

Avoid corner cases such as \(s_p = 0\), \(\overline{X} = 0\), or \(\overline{Y} = 0\) by adding a very small number (e.g. \(10^{-100}\)) when overflow/underflow may occur.

#' ttest_mtx : Performs t-test between two sets of rows in a matrix, separately for each column
#'
#' This function takes a numpy matrix and two lists of row indices to perform 
#' column-wise t-tests between two subsets of rows.  
#'  
#' @param mtx : A matrix containing normalized expression matrix (# barcodes x # genes)
#' @param rows_a : A list of integer containing the row indices of one group to be used for t-test
#' @param rows_b : A list of integer containing the row indices of the other group to be used for t-test
#' @return : A dataframe containing the following attributes
#'  * gene   : gene name
#'  * tstat  : T test statistics
#'  * log10p : Two-sided -log10 p-values corresponding to the t-test statistics 
#'  * fc     : fold changes of average expression between the two groups as mean(X)/mean(Y)
#'  * avg_a  : Average values of rows_a for each gene
#'  * avg_b  : Average values of rows_b for each gene
ttest_mtx <- function(mtx, rows_a, rows_b) {
  mtx_a = mtx[rows_a,]
  mtx_b = mtx[rows_b,]
  n_a = length(rows_a)
  n_b = length(rows_b)
  m_a = colMeans(mtx_a)
  m_b = colMeans(mtx_b)
  s_p = sqrt( (apply(mtx_a,2,var)*(n_a-1) + apply(mtx_b,2,var)*(n_b-1))/(n_a+n_b-2) )
  tstat = (m_a - m_b)/(s_p+1e-100)/sqrt(1/n_a+1/n_b)
  fc = (m_a + 1e-100) / (m_b + 1e-100)
  log10p = 0-(pt(abs(tstat),df=n_a+n_b-2,lower.tail=FALSE,log.p=TRUE)+log(2))/log(10)
  return(data.frame(gene=colnames(mtx),tstat=tstat, log10p=log10p, fc=fc, avg_a=m_a, avg_b=m_b))
}

[DIY] Use ttest_mtx() function to identify strongly enriched genes for each cluster

For each cluster, identify top 20 genes, ordered by highest fc, among the genes with p-values smaller than \(10^{-6}\) a significant Bonferroni threshold. These genes are representatives of the strongly enriched genes for the particular cluster.

You may want to use https://amp.pharm.mssm.edu/Enrichr/ to evaluate which cell types those genes are enriched for.

# For each cluster 1, 2, 3, 4
# 1. Identify genes with t-test pvalue < 1e-6 and avg_a > 1
# 2. Order the significant genes from highest fc, and select the top 20
# 3. Print the genes with t-statistics, log10 p-values and fc
for(clust in 1:4) {
  rst = ttest_mtx(norm_dge, which(df_bcd$kmeans == clust), which(df_bcd$kmeans != clust))
  rst_top = rst[(rst$log10p > 6) & (rst$avg_a > 1),]
  print(rst_top[order(rst_top$fc,decreasing=TRUE)[1:20],])
}

##            gene    tstat    log10p        fc    avg_a      avg_b
## 31699    VPREB3 42.17725 298.30909 82.519108 1.076508 0.01304556
## 23927 LINC00926 47.67651 359.67676 77.016735 1.297919 0.01685243
## 23289     TCL1A 51.50504 402.81935 44.260894 1.772238 0.04004071
## 30657     CD79A 96.06616 872.50860 36.159167 2.792092 0.07721671
## 18945     MS4A1 70.45460 613.21001 24.467855 2.171395 0.08874482
## 7899      BANK1 33.61736 206.52200 16.395620 1.025644 0.06255599
## 27508     CD79B 61.05114 510.06282 10.638597 2.657399 0.24978845
## 10685  HLA-DQA2 31.39814 183.99242 10.544694 1.156673 0.10969240
## 10681  HLA-DQA1 52.80331 417.46748 10.522989 2.449022 0.23273061
## 21419     HVCN1 29.02845 160.73990  9.548051 1.050463 0.11001853
## 10682  HLA-DQB1 48.94069 373.90387  8.895701 2.361175 0.26542876
## 10694   HLA-DMB 24.49438 119.08430  5.489059 1.186735 0.21620002
## 10696   HLA-DMA 25.07792 124.20902  4.260188 1.649692 0.38723457
## 10679  HLA-DRB5 27.01131 141.70625  3.673761 2.242455 0.61039755
## 10680  HLA-DRB1 30.90212 179.05210  3.315986 3.244007 0.97829353
## 10678   HLA-DRA 36.89516 240.89490  3.300208 4.486114 1.35934276
## 10701  HLA-DPB1 33.76140 208.00674  3.265403 3.536743 1.08309547
## 10700  HLA-DPA1 29.93968 169.57444  3.111292 3.221487 1.03541770
## 16793     SNHG7 13.88351  41.65585  2.627234 1.050308 0.39977714
## 9799       CD74 38.14482 254.28472  2.232412 4.950227 2.21743437
##           gene    tstat    log10p        fc    avg_a      avg_b
## 22589     GZMH 40.03756 274.80997 34.314471 1.427112 0.04158922
## 7345    FGFBP2 41.64481 292.43591 31.732282 1.513175 0.04768567
## 22590     GZMB 41.92950 295.57425 25.815156 1.805253 0.06992996
## 17450     PRF1 49.07920 375.46417 18.904971 1.834687 0.09704784
## 28833     CST7 67.63934 582.69560 18.552820 2.368951 0.12768686
## 8890      GZMA 63.57977 538.12300 16.323033 2.370504 0.14522446
## 26840     CCL4 32.29559 193.02232 14.958437 1.222396 0.08171949
## 31076     NKG7 83.97400 754.41294 12.069899 3.957020 0.32784200
## 4047      GNLY 37.91376 251.79852 11.717988 2.154341 0.18384906
## 8887      GZMK 23.90130 113.95450  7.868179 1.109920 0.14106444
## 7587      HOPX 27.30240 144.40598  7.440028 1.139513 0.15315972
## 26825     CCL5 49.78182 383.38253  6.979398 3.334606 0.47777840
## 19180     CTSW 46.74063 349.16439  6.764457 2.364328 0.34952226
## 29406     GZMM 31.90405 189.06866  5.491018 1.542283 0.28087381
## 7224      LYAR 20.60280  86.98492  4.445071 1.027226 0.23109326
## 2301     CD247 23.18345 107.85497  3.815640 1.365672 0.35791429
## 32017 APOBEC3G 19.49816  78.59332  3.748558 1.018984 0.27183358
## 2238    FCGR3A 16.59790  58.25954  3.639044 1.127840 0.30992758
## 19539     CTSC 15.55194  51.57065  2.724871 1.055996 0.38753983
## 21043    RAP1B 14.47384  45.05539  2.220766 1.243979 0.56015743
##           gene     tstat    log10p       fc    avg_a      avg_b
## 30115    IFI30  49.23862  377.2603 75.19287 1.182536 0.01572670
## 9634      CD14  41.02185  285.5851 46.19765 1.149775 0.02488817
## 440        CDA  40.98333  285.1622 44.38290 1.003595 0.02261221
## 31959   LGALS2  62.38750  524.9180 40.64937 2.012480 0.04950827
## 18938   MS4A6A  41.80989  294.2551 33.61999 1.201869 0.03574864
## 29429      CFD  66.99198  575.6310 31.91044 1.891644 0.05927978
## 26287     CD68  56.52615  459.4096 22.12496 1.616658 0.07306941
## 1958    S100A8  67.49051  581.0730 21.24052 3.070813 0.14457331
## 13509      CFP  48.64243  370.5449 18.48908 1.410348 0.07628007
## 2483      NCF2  39.57790  269.8006 17.76441 1.075129 0.06052150
## 18773     SPI1  49.65047  381.9018 17.68410 1.460989 0.08261596
## 16745     FCN1  77.17353  684.5459 16.59471 2.667224 0.16072737
## 18149   IFITM3  49.38551  378.9155 16.14298 1.729314 0.10712489
## 23264 SERPINA1  53.71164  427.7129 14.49279 1.696494 0.11705778
## 1956    S100A9  84.81491  762.8816 14.27356 3.830041 0.26833120
## 12571     BRI3  41.33746  289.0530 12.63839 1.219273 0.09647373
## 10609     LST1  92.17622  835.3832 12.04173 3.093488 0.25689734
## 27141      GRN  45.98947  340.7425 11.22910 1.506885 0.13419466
## 28818     CST3 114.50953 1037.6725 11.11242 3.915089 0.35231634
## 10611     AIF1  87.64681  791.1209 10.28272 3.148724 0.30621508
##           gene    tstat    log10p       fc    avg_a     avg_b
## 19802     CD3D 38.86684 262.08192 3.801846 2.094871 0.5510143
## 8764      IL7R 27.59258 147.11361 3.766821 1.440688 0.3824678
## 19801     CD3E 30.10199 171.16243 3.239243 1.675256 0.5171750
## 30976    NOSIP 23.28854 108.74029 2.867435 1.332223 0.4646044
## 20398     LDHB 40.13779 275.90430 2.661710 2.400268 0.9017769
## 24758     IL32 28.40130 154.74251 2.636750 2.190333 0.8306944
## 687        LCK 17.61133  65.07381 2.514199 1.034972 0.4116510
## 27884      CD7 16.26144  56.06913 2.343425 1.108243 0.4729159
## 29533      AES 18.43955  70.87793 2.093202 1.375246 0.6570058
## 10608      LTB 28.89421 159.45043 2.092303 2.798037 1.3373004
## 19620  TMEM123 14.32029  44.15960 1.927135 1.080111 0.5604750
## 13126   GIMAP7 15.18029  49.28053 1.766998 1.463028 0.8279738
## 23336      EVL 14.60454  45.82431 1.724737 1.388695 0.8051632
## 2760  TRAF3IP3 12.59349  34.65300 1.688485 1.196732 0.7087606
## 9465     HINT1 16.30733  56.36579 1.632188 1.719157 1.0532837
## 14620   TMEM66 20.07951  82.96723 1.626411 2.074055 1.2752339
## 18147   IFITM1 11.37413  28.58691 1.610993 1.234452 0.7662674
## 32399     SOD1 11.81773  30.73041 1.587203 1.249588 0.7872896
## 10642  C6orf48 11.79417  30.61475 1.550436 1.341331 0.8651315
## 3668   ZFP36L2 14.42680  44.78009 1.538502 1.771961 1.1517443

BIOSTAT/BIOINF/HUMGEN606 (2019) Day 5 - with spoilers