This is an R-version of Biocomputing Bootcamp Dat 5 material, which was written in python notebook first at https://colab.research.google.com/drive/10toqlqt1HYluzFLVW5a13cWzbWoXBuob. It is recommended to finish the python version first, unless the student definitely wants to skip the python part. Many parts of of this notebook assumes that you’ve already finished the python part, although it keeps redundant information for the R-only users.
Droplet-based single-cell RNA-seq is a new technology that allows us to profile thousands of single cell transcriptomes with a single library preparation. We will use publicly available dataset to perform basic analysis using the skillsets we have learned during this week. See the overview lecture (posted in the course web site) to have some basic understandings of scRNA-seq technology
We will download 2,700 PBMC cells publicly available from 10X Chromium web site. Please use a web browser to download the file.
Uncompress the downloaded file using a file browser (Windows Explorer or Finder). Extract the file, then you will see a directory named filtered_gene_bc_matrices
. Here we will assume that you downloaded the files to a directory named ~/Downloads
(~
represents your current home directory in most systems). If you downloaded the file in another directory, you will need to make a corresponding change.
Use the following codes to make sure that the file exists. You may need to modify indir
in the code below. We will read first ten tokens to see how the files look like.
## [DIY] Make corresponding changes in the code below if needed.
indir = "~/Downloads/filtered_gene_bc_matrices/hg19"
print(scan(paste0(indir,"/barcodes.tsv"),what=character(),nmax=10))
## [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
## [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"
## [7] "AAACGCTGACCAGT-1" "AAACGCTGGTTCTT-1" "AAACGCTGTAGCCA-1"
## [10] "AAACGCTGTTTCTG-1"
print(scan(paste0(indir,"/genes.tsv"),what=character(),nmax=10))
## [1] "ENSG00000243485" "MIR1302-10" "ENSG00000237613"
## [4] "FAM138A" "ENSG00000186092" "OR4F5"
## [7] "ENSG00000238009" "RP11-34P13.7" "ENSG00000239945"
## [10] "RP11-34P13.8"
print(scan(paste0(indir,"/matrix.mtx"),what=character(),nmax=10))
## [1] "%%MatrixMarket" "matrix" "coordinate" "real"
## [5] "general" "%" "32738" "2700"
## [9] "2286884" "32709"
Q Do you understand what how the file look like? If unclear, use text editor to load the file, or use read.table()
function instead to read the data in a tabular form.
Your first assignment is to write a R function that takes three input files - bcdf (e.g. barcode.tsv), genef (e.g. genes.tsv), mtx (e.g. matrix.mtx) - to return a numpy matrix and associated lists. The detailed specification of the function is given in the code segment below.
You may use the read.table()
or scan()
functions to read three files
#' read_dge() : function to read digital expression matrix in 10x genomics format
#'
#' This function takes three files as input to return a digital expression matrix.
#' Each input file should contain information of cell barcodes, genes, and UMI counts with
#' specific formats described below. The function will return a matrix where each row
#' represents a droplet barcode and each column represents a gene.
#'
#' @param bcdf Barcode file (typically barcodes.tsv) containing one cell barcode at each line
#' @param genef Gene file (typically genes.tsv) containing [GENE_ID] and [GENE_NAME] at each line.
#' @param mtxf Sparse matrix file (typically matrix.mtx) containing [GENE_INDEX] [BARCODE_INDEX] #' [UMI_COUNT] from 4-th line for any non-zero entry.
#' @return A matrix of ([TOTAL_BARCODE_COUNTS] x [TOTAL_GENE_COUNTS]) containing a digital
#' expression matrix. Each row name represents barcodes and column name represents
#' gene name (not gene id)
read_dge <- function(bcdf, genef, mtxf) {
bcds <- read.table(bcdf)
genes <- read.table(genef)
mtxs <- read.table(mtxf,skip=3)
nbcds <- nrow(bcds)
ngenes <- nrow(genes)
mat <- matrix(0, nbcds, ngenes)
rownames(mat) <- bcds[,1]
colnames(mat) <- genes[,2]
for(i in 1:nrow(mtxs)) {
mat[mtxs[i,2],mtxs[i,1]] = mtxs[i,3]
}
return(mat)
}
Once you implemented the read_dge()
function, you can run the function using the following specific arguments.
# Run read_dge() function. This may take a while
dge <- read_dge(paste0(indir,"/barcodes.tsv"),paste0(indir,"/genes.tsv"),paste0(indir,"/matrix.mtx"))
Use the following codes to see where the returned matrix conforms to the expected format. It is expected to be a 2700 x 32738 matrix, with first row name AAACATACAACCAC-1
and the first column name MIR1302-10
.
print(dim(dge))
## [1] 2700 32738
print(dge[1:5,1:5])
## MIR1302-10 FAM138A OR4F5 RP11-34P13.7 RP11-34P13.8
## AAACATACAACCAC-1 0 0 0 0 0
## AAACATTGAGCTAC-1 0 0 0 0 0
## AAACATTGATCAGC-1 0 0 0 0 0
## AAACCGTGCTTCCG-1 0 0 0 0 0
## AAACCGTGTATGCG-1 0 0 0 0 0
Using ggplot2
package, we will summarize and visualize the basic statistics from digital expression matrix.
We do not know much about the data we have just loaded yet. Let’s try to find out the answer for the following questions.
Each of these questions can be answered with one or two lines of simple R code.
# DIY -- Write simple lines of codes to answer to each of these basic questions
# 1. What is the total number of UMis across all genes and barcodes?
# Write your code here
print(sum(dge))
## [1] 6390631
# 2. How sparse is the digital matrix? How much proportion of barcode/gene pairs have nonzero elements?
# Write your code here
print(sum(dge>0) / length(dge))
## [1] 0.02587189
# 3. Are theres gene that are not expressed at all? If so, how many such genes exist?
# Write your code here
print(sum(colSums(dge)>0))
## [1] 16634
The next, task is to collect basic summary statistics for each barcoded droplet. * Create a dataframe named df_bcd
with the following two summary statistics for each barcode. * nUMI
: The total sum of UMI counts across all genes for given barcoded droplet. * nGene
: The number of expressed genes (genes with positive UMI counts) for given barcoded droplet. * The row names of dataframe should be the droplet barcode. * Upon successful creation, the first five rows should look like the following example.
print(head(df_bcd))
nUMI nGene
AAACATACAACCAC-1 2421.0 781
AAACATTGAGCTAC-1 4903.0 1352
AAACATTGATCAGC-1 3149.0 1131
AAACCGTGCTTCCG-1 2639.0 960
AAACCGTGTATGCG-1 981.0 522
# Summarize basic statistics for each barcode
# [DIY] Here, create a dataframe named df_bcd, with nUMI and nGene.
df_bcd <- data.frame(nUMI=rowSums(dge), nGene=rowSums(dge>0), row.names=rownames(dge))
print(head(df_bcd))
## nUMI nGene
## AAACATACAACCAC-1 2421 781
## AAACATTGAGCTAC-1 4903 1352
## AAACATTGATCAGC-1 3149 1131
## AAACCGTGCTTCCG-1 2639 960
## AAACCGTGTATGCG-1 981 522
## AAACGCACTGGTAC-1 2164 782
Use summary()
function to summarize each attribute in the dataframe created above.
# Calculate the basic summary statistics
print(summary(df_bcd))
## nUMI nGene
## Min. : 548 Min. : 212.0
## 1st Qu.: 1758 1st Qu.: 690.0
## Median : 2197 Median : 817.0
## Mean : 2367 Mean : 847.0
## 3rd Qu.: 2763 3rd Qu.: 953.2
## Max. :15844 Max. :3422.0
You may use ggplot2
and geom_histogram
to draw the distribution of each variable. as shown in the examples below.
library(ggplot2)
# draw a histogram of nUMI and nGene
# Check the output and see if they make sense.
print(ggplot(data=df_bcd) + aes(x=nGene) + geom_histogram(binwidth=100,color='black',fill='grey'))
print(ggplot(data=df_bcd) + aes(x=nUMI) + geom_histogram(binwidth=100,color='black',fill='grey'))
Draw a scatterplot where x-axis is nGene
and y-axis is nUMI
using the dataframe df_bcd
. Using ggplot2
package is recommended.
# Show the relationship between nUMI and nGene by cells
# [DIY] Here, create a ggplot as specified.
print(ggplot(data=df_bcd) + aes(x=nGene,y=nUMI) + geom_point())
The plots you create will have area with too much overlaps with points to understand which area is more densely populated with points. Add alpha
parameter (e.g. value of 0.1) to add transparency in the points.
# Add alpha = 0.1 parameter to add transparency
# [DIY] Here, enter the modified line of ggplot command
print(ggplot(data=df_bcd) + aes(x=nGene,y=nUMI) + geom_point(alpha=0.1))
You may have noticed that the distribution may look more symmetric if it was plotted in log-scale. Change both x-axis and y-axis into log-scale.
Note that it is safer to plot nGene+1
and nUMI+1
to avoid losing data points with zero values.
# Plot nGene+1 and nUMI+1 in log10 scale in both axis
# [DIY] Here, enter the modified line of ggplot command
print(ggplot(data=df_bcd) + aes(x=nGene+1,y=nUMI+1) + geom_point(alpha=0.1)+scale_x_log10()+scale_y_log10())
Similarly, we can collect basic summary statistics for each gene, across all barcodes. * Create a dataframe named df_gene
with the following two summary statistics for each gene. * nUMI
: The total sum of UMI counts across all barcodes for the given gene. * nBCD
: The number of droplet barcodes expressing the gene (with positive UMI counts). * The dataframe should be indexed by each gene name. * Upon successful creation, the first five rows should look like the following example.
print(head(df_gene))
nUMI nBCD
MIR1302-10 0 0
FAM138A 0 0
OR4F5 0 0
RP11-34P13.7 0 0
RP11-34P13.8 0 0
AL627309.1 9 9
# Summarize basic statistics for each gene
# [DIY] Here, create a dataFrame named df_gene, with nUMI and nBCD as attributes
# This task can be done in as few as one line
# Check out pandas cheatsheet and tutorial if needed.
df_gene = data.frame(gene=colnames(dge), nUMI = colSums(dge), nBCD = colSums(dge>0))
print(head(df_gene)) # This command will allow you to check whether the output is the same as expected.
## gene nUMI nBCD
## 1 MIR1302-10 0 0
## 2 FAM138A 0 0
## 3 OR4F5 0 0
## 4 RP11-34P13.7 0 0
## 5 RP11-34P13.8 0 0
## 6 AL627309.1 9 9
Use summary()
function to summarize each attribute in the pandas dataframe created above.
Q Compared to df_bcd
examples, what is the outstanding difference in the summary statistics by gene?
# calculate the basic summary statistics for each gene
print(summary(df_gene))
## gene nUMI nBCD
## 7SK : 4 Min. : 0.0 Min. : 0.00
## NPIPA7 : 3 1st Qu.: 0.0 1st Qu.: 0.00
## Y_RNA : 3 Median : 1.0 Median : 1.00
## ACE : 2 Mean : 195.2 Mean : 69.85
## ALG9 : 2 3rd Qu.: 37.0 3rd Qu.: 35.00
## ANKRD18A: 2 Max. :161685.0 Max. :2700.00
## (Other) :32722
You may use geom_histogram
to draw the distribution of each variable in df_gene
.
# draw a histogram of nUMI and nBCD
# Use geom_histogram, similar to the examples in df_bcd.
print(ggplot(data=df_gene) + aes(x=nBCD) + geom_histogram(binwidth=100,color='black',fill='grey'))
print(ggplot(data=df_gene) + aes(x=nUMI) + geom_histogram(binwidth=1000,color='black',fill='grey'))
You may want to explore the dataset we loaded in various other ways. Use this area as a sandbox to create any exploratory analysis on the summary statistics you’ve generated.
# This is a sandbox area. Feel free to add any additional analysis
# to better understand the data you've loaded.
Next, we will perform a simple normalization of the expression matrix and will perform manifold learning on the normalized expressions.
As shown in the summary statistics of each gene, there are relatively a small number of genes that are much highly expressed than other genes, so they may explain most of the variations between the cells. To alleviate such effects, one common practice is to convert the data using a simple log-normalization.
Suppose \(X \in \{0,1,2,\ldots\}^{B \times G}\) be the matrix of UMI counts across \(B \in \{1,2,\ldots\}\) barcoded droplets and \(G \in \{1,2,\ldots\}\) genes. Then the sum of UMI counts across the gene for each barcoded droplet is
\[ C_b = \sum_{g=1}^G X_{bg} \]
Then the normalized expression matrix \(Y \in [0,\infty)^{B \times G}\) can be represented as
\[ Y_{bg} = \log\left[1+X_{bg}\frac{F}{C_b}\right] \]
where \(F > 0\) is a fixed scaling factor. We will use \(F = 10,000\) for our analysis.
Now, your task is create a new matrix norm_dge
from the UMI count matrix dge
, using the normalization equation given above using \(F = 10,000\). Use log1p()
function if needed.
# Log-normalize the digital expression matrix
# The input matrix is dge
# The output matrix is to be named as norm_dge
# [DIY] Calculate norm_dge from dge
norm_dge <- log1p(dge / matrix(df_bcd$nUMI/10000,nrow(dge),ncol(dge),byrow=FALSE))
UMAP (Uniform Manifold Approximation and Projection) is an emerging method to create low-dimensional manifold from high-dimensional data, and it is one of the most popularly used method for understanding the cell types of single cell genomic data. For more information, refer to the following resource. * https://umap-learn.readthedocs.io/en/latest/ * https://arxiv.org/abs/1802.03426 We will use uwot
package, which is an R implementation of UMAP to take norm_dge
as input to create 2-dimensional manifold umap_dge
for each barcoded droplet. First, we will make sure that uwot
is installed in your R environment. If not,
## Check if uwot package is installed.
if ( require('uwot') == FALSE ) {
install.packages('uwot') ## install if not already installed
library('uwot')
} else {
print("uwot package is already installed")
}
## Loading required package: uwot
## Loading required package: Matrix
## [1] "uwot package is already installed"
We will compute UMAP coordinates using uwot
package, with default parameter, except that we will first compute 20 principal components (PCs) and perform UMAP on the PCs, to save the computational time.
## Calculate umap from normalized dge, using 20 PCs
## Report the elapsed time
print( system.time( umap_dge <- umap(norm_dge, pca=20) ) )
## user system elapsed
## 24.903 1.903 30.436
Let’s take a peek of the output from the UMAP procedure above.
print(dim(umap_dge))
## [1] 2700 2
print(head(umap_dge))
## [,1] [,2]
## [1,] 3.072517 -3.0550035
## [2,] -6.050866 -10.0906606
## [3,] 3.684618 0.1082747
## [4,] -2.380720 11.6514998
## [5,] 4.270111 -9.2690547
## [6,] 3.822707 -0.9010277
Modify the dataframe df_bcd
by adding two additional attributes - UMAP1
and UMAP2
- using umap_dge
computed above. This will allow us to visualize the UMAP manifolds with respect to other summary statistics.
# [DIY] Add 'UMAP1' and 'UMAP2' attributes to the DataFrame df_bcd
# Use two columns from umap_mtx as 'UMAP1' and 'UMAP2', respectively.
df_bcd$UMAP1 = umap_dge[,1]
df_bcd$UMAP2 = umap_dge[,2]
# Just to peek the beginning of the modified dataframe.
# Make sure UMAP1 and UMAP2 exist in the output
print(head(df_bcd))
## nUMI nGene UMAP1 UMAP2
## AAACATACAACCAC-1 2421 781 3.072517 -3.0550035
## AAACATTGAGCTAC-1 4903 1352 -6.050866 -10.0906606
## AAACATTGATCAGC-1 3149 1131 3.684618 0.1082747
## AAACCGTGCTTCCG-1 2639 960 -2.380720 11.6514998
## AAACCGTGTATGCG-1 981 522 4.270111 -9.2690547
## AAACGCACTGGTAC-1 2164 782 3.822707 -0.9010277
Create a scatterplot where x-axis is UMAP1
, and y-axis is UMAP2
and each point is colored by nUMI
attributes, where each point represents a barcoded droplet. Make sure that points are reasonably distinguishable by modifying the size and transparency of the points.
Q Does the output plot make sense to you? Can you interpret how they are clustered?
# [DIY] Create a scatterplot, with the following specs:
# * x-axis is UMAP1
# * y-axis is UMAP2
# * color of each point is nUMI
# * adjust the size and transparency of the points for better presentation
print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=nUMI) + geom_point(size=0.5,alpha=0.3))
The UMAP manifolds we generated above give hints about cell-type-specific nature of the single-cell transcriptomic profiles.
Using existing knowledges of these PBMC cell types, we will visualize the distribution of gene expressions across the manifold space.
As suggested https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html , these are (incomplete) examples biomarkers representing known cell types.
Marker Genes | Cell Types (in the tutorial) | Cell Types from Other Sources |
---|---|---|
IL7R | CD4+ T cell | T helper cell |
CCR7 | Naive CD4+ T cell | Naive T cell |
S100A4 | Memory CD4+ T cell | Memory T-cell |
CD14 | CD14+ Monocyte | Monocytes, Macrophages, and Dentritic Cell |
LYZ | CD14+ Monocyte | Monocytes, Macrophages, and Dentritic Cell |
MS4A1 | B cell | B-cell |
CD8A | CD8+ T cell | Cytotoxic T-cell |
FCGR3A | FCGR3A+ Monocyte | CD16+ Monocytes and NK cell |
MS4A7 | FCGR3A+ Monocyte | CD16+ Monocytes |
GNLY | NK cell | Natural Killer (NK) cell |
NKG7 | NK cell | Natural Killer (NK) cell |
FCER1A | Dentritic cell | Dendritic cell |
CST | Dentritic cell | Dendritic cell |
PPBP | Platelet | Platelet |
In the code segment below marker_genes
and marker_types
contains the gene names and cell types enriched for the particular cell types. For each marker gene, perform the following tasks
df_bcd
for each marker gene, by using the gene name as the column name and taking the values from corresponding genes in norm_mtx
.UMAP1
attribute as x-axis, and UMAP2
attributes as y-axis, and the normalized expression of the marker gene as the color.scale_color_gradient()
function to chose an effective color scheme.ggtitle()
to give informative title (e.g. gene name and enriched cell types) for each of the plot.marker_genes = c('IL7R', 'CCR7', 'S100A4', 'CD14', 'LYZ', 'MS4A1', 'CD8A', 'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP')
marker_types = c('CD4+ T','Naive CD4+ T','Memory CD4+ T','CD14+ Mono','CD14+ Mono','B','CD8+ T','FCGR3A+ Mono','FCGR3A+ Mono','NK','NK','Dendritic','Dendritic','Platelet')
# [DIY] For each marker gene, add additional column to df_bcd to represent normalized gene expression levels.
for (i in 1:length(marker_genes)) {
ig = which(colnames(norm_dge) == marker_genes[i])
df_bcd[[marker_genes[i]]] = norm_dge[,ig]
}
print(head(df_bcd)) # This line will give an idea whether each marker gene was properly added as individual columns
## nUMI nGene UMAP1 UMAP2 IL7R CCR7
## AAACATACAACCAC-1 2421 781 3.072517 -3.0550035 2.594626 1.635208
## AAACATTGAGCTAC-1 4903 1352 -6.050866 -10.0906606 1.625141 0.000000
## AAACATTGATCAGC-1 3149 1131 3.684618 0.1082747 1.429261 0.000000
## AAACCGTGCTTCCG-1 2639 960 -2.380720 11.6514998 0.000000 0.000000
## AAACCGTGTATGCG-1 981 522 4.270111 -9.2690547 0.000000 0.000000
## AAACGCACTGGTAC-1 2164 782 3.822707 -0.9010277 0.000000 0.000000
## S100A4 CD14 LYZ MS4A1 CD8A FCGR3A
## AAACATACAACCAC-1 2.594626 0 1.635208 0.000000 1.635208 0.000000
## AAACATTGAGCTAC-1 0.000000 0 1.962726 2.583047 0.000000 0.000000
## AAACATTGATCAGC-1 2.826014 0 1.994867 0.000000 0.000000 0.000000
## AAACCGTGCTTCCG-1 4.121133 0 4.521175 0.000000 0.000000 1.566387
## AAACCGTGTATGCG-1 3.950636 0 0.000000 0.000000 0.000000 0.000000
## AAACGCACTGGTAC-1 3.751611 0 1.726522 0.000000 0.000000 0.000000
## MS4A7 GNLY NKG7 FCER1A CST3 PPBP
## AAACATACAACCAC-1 0.000000 0.000000 0.000000 0 0.000000 0.000000
## AAACATTGAGCTAC-1 0.000000 0.000000 1.111715 0 1.111715 0.000000
## AAACATTGATCAGC-1 0.000000 1.429261 0.000000 0 1.429261 0.000000
## AAACCGTGCTTCCG-1 2.515108 0.000000 1.566387 0 4.435152 1.566387
## AAACCGTGTATGCG-1 0.000000 3.452557 4.728542 0 0.000000 0.000000
## AAACGCACTGGTAC-1 0.000000 0.000000 0.000000 0 0.000000 0.000000
# [DIY] For each marker gene, create a scatter plot of x-axis:UMAP1, y-axis:UMAP2, color:normalized expression levels.
# Choose an effective visualization scheme yourself.
for(i in 1:length(marker_genes)) {
g = marker_genes[i]
t = marker_types[i]
print(ggplot(data=df_bcd) + aes(x=UMAP1,y=UMAP2,color=get(g)) + geom_point(size=0.5,alpha=0.3) + scale_colour_gradient(low='#cccccc',high='#0000ff') + labs(color=g) + ggtitle(paste0('Gene : ',g,' Type : ',t)))
}
We can cluster each barcoded droplets by cell types. While there are clearly better ways to cluster cell types, here we will use a simplistic approach - k-means clustering based on the UMAP coordinates - as a quick approach.
We will use kmeans()
function to perform a k-means clustering, and the example codes are given below. Because there are four major clusters observed, we will identify 4 clusters.
set.seed(606)
umap_kmeans <- kmeans(umap_dge, 4)
df_bcd$kmeans <- umap_kmeans$cluster
print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=factor(kmeans)) + geom_point(size=0.5,alpha=0.3) + labs(color='Cluster'))
Note that the clusters do not look exactly the same to the python examples due to many differences in the details of UMAP and kmeans algorithm. The four clusters roughly corresponds to (1) Monocyte + Dendritic cells, (2) B cells and platelets, (3) CD8+ T & NK cells, (4) CD4+ T cells.
Now, repeat the visualization of marker genes expressions performed previously, but now separating the plots by the clusters from k-means clustering. You may use facet_wrap()
to achieve this functionality.
# [DIY] For each marker gene, create scatter plots of barcoded droplets.
# x-axis : UMAP1
# y-axis : UMAP2
# color : marker gene expression
# facet : cluster
# Use gene name and cell types in the title of each plot.
for(i in 1:length(marker_genes)) {
g = marker_genes[i]
t = marker_types[i]
print(ggplot(data=df_bcd) +aes(x=UMAP1,y=UMAP2,color=get(g)) + geom_point(size=0.2,alpha=0.3) + scale_colour_gradient(low='#cccccc',high='#0000ff') + facet_wrap(~factor(kmeans),nrow=1) + labs(color=g) + ggtitle(paste0('Gene : ',g,' Type : ',t)) )
}
Finally, we will identify genes that are differentially expressed between different cell types (i.e. clusters). Specifically, we will identify marker genes that shows statistically significant enrichment in each particular cluster. To achieve this, you need to implement a function ttest_mtx()
that performs t-tests between two sets of rows across each individual columns.
For \(X = (X_1,\ldots,X_{n_X})\) and \(Y = (Y_1,\ldots,Y_{n_Y})\), the equation for unpaired student t-test statistic follows
\[ T = \frac{\overline{X} - \overline{Y}}{s_p\sqrt{\frac{1}{n_X}+\frac{1}{n_Y}}} \]
where
\[ s_p^2 = \frac{\sum_{i=1}^{n_X}(X_{i}-\overline{X})^2 + \sum_{j=1}^{n_Y}(Y_{j}-\overline{Y})^2}{n_X+n_Y-2} \]
The resulting \(T\) is expected to follow a Student’s t-distribution with \(n_X+n_Y-2\) degrees of freedom under the null hypothesis.
You will need to calculate \(T\) and corresponding p-values for each column separately, and return a dataframe containing the following three attributes:
scipy.stats.t.sf()
to compute the p-value.The number of rows in the dataframe should be the same to the number of columns in the input matrix.
Avoid corner cases such as \(s_p = 0\), \(\overline{X} = 0\), or \(\overline{Y} = 0\) by adding a very small number (e.g. \(10^{-100}\)) when overflow/underflow may occur.
#' ttest_mtx : Performs t-test between two sets of rows in a matrix, separately for each column
#'
#' This function takes a numpy matrix and two lists of row indices to perform
#' column-wise t-tests between two subsets of rows.
#'
#' @param mtx : A matrix containing normalized expression matrix (# barcodes x # genes)
#' @param rows_a : A list of integer containing the row indices of one group to be used for t-test
#' @param rows_b : A list of integer containing the row indices of the other group to be used for t-test
#' @return : A dataframe containing the following attributes
#' * gene : gene name
#' * tstat : T test statistics
#' * log10p : Two-sided -log10 p-values corresponding to the t-test statistics
#' * fc : fold changes of average expression between the two groups as mean(X)/mean(Y)
#' * avg_a : Average values of rows_a for each gene
#' * avg_b : Average values of rows_b for each gene
ttest_mtx <- function(mtx, rows_a, rows_b) {
mtx_a = mtx[rows_a,]
mtx_b = mtx[rows_b,]
n_a = length(rows_a)
n_b = length(rows_b)
m_a = colMeans(mtx_a)
m_b = colMeans(mtx_b)
s_p = sqrt( (apply(mtx_a,2,var)*(n_a-1) + apply(mtx_b,2,var)*(n_b-1))/(n_a+n_b-2) )
tstat = (m_a - m_b)/(s_p+1e-100)/sqrt(1/n_a+1/n_b)
fc = (m_a + 1e-100) / (m_b + 1e-100)
log10p = 0-(pt(abs(tstat),df=n_a+n_b-2,lower.tail=FALSE,log.p=TRUE)+log(2))/log(10)
return(data.frame(gene=colnames(mtx),tstat=tstat, log10p=log10p, fc=fc, avg_a=m_a, avg_b=m_b))
}
For each cluster, identify top 20 genes, ordered by highest fc
, among the genes with p-values smaller than \(10^{-6}\) a significant Bonferroni threshold. These genes are representatives of the strongly enriched genes for the particular cluster.
You may want to use https://amp.pharm.mssm.edu/Enrichr/ to evaluate which cell types those genes are enriched for.
# For each cluster 1, 2, 3, 4
# 1. Identify genes with t-test pvalue < 1e-6 and avg_a > 1
# 2. Order the significant genes from highest fc, and select the top 20
# 3. Print the genes with t-statistics, log10 p-values and fc
for(clust in 1:4) {
rst = ttest_mtx(norm_dge, which(df_bcd$kmeans == clust), which(df_bcd$kmeans != clust))
rst_top = rst[(rst$log10p > 6) & (rst$avg_a > 1),]
print(rst_top[order(rst_top$fc,decreasing=TRUE)[1:20],])
}
## gene tstat log10p fc avg_a avg_b
## 31699 VPREB3 42.17725 298.30909 82.519108 1.076508 0.01304556
## 23927 LINC00926 47.67651 359.67676 77.016735 1.297919 0.01685243
## 23289 TCL1A 51.50504 402.81935 44.260894 1.772238 0.04004071
## 30657 CD79A 96.06616 872.50860 36.159167 2.792092 0.07721671
## 18945 MS4A1 70.45460 613.21001 24.467855 2.171395 0.08874482
## 7899 BANK1 33.61736 206.52200 16.395620 1.025644 0.06255599
## 27508 CD79B 61.05114 510.06282 10.638597 2.657399 0.24978845
## 10685 HLA-DQA2 31.39814 183.99242 10.544694 1.156673 0.10969240
## 10681 HLA-DQA1 52.80331 417.46748 10.522989 2.449022 0.23273061
## 21419 HVCN1 29.02845 160.73990 9.548051 1.050463 0.11001853
## 10682 HLA-DQB1 48.94069 373.90387 8.895701 2.361175 0.26542876
## 10694 HLA-DMB 24.49438 119.08430 5.489059 1.186735 0.21620002
## 10696 HLA-DMA 25.07792 124.20902 4.260188 1.649692 0.38723457
## 10679 HLA-DRB5 27.01131 141.70625 3.673761 2.242455 0.61039755
## 10680 HLA-DRB1 30.90212 179.05210 3.315986 3.244007 0.97829353
## 10678 HLA-DRA 36.89516 240.89490 3.300208 4.486114 1.35934276
## 10701 HLA-DPB1 33.76140 208.00674 3.265403 3.536743 1.08309547
## 10700 HLA-DPA1 29.93968 169.57444 3.111292 3.221487 1.03541770
## 16793 SNHG7 13.88351 41.65585 2.627234 1.050308 0.39977714
## 9799 CD74 38.14482 254.28472 2.232412 4.950227 2.21743437
## gene tstat log10p fc avg_a avg_b
## 22589 GZMH 40.03756 274.80997 34.314471 1.427112 0.04158922
## 7345 FGFBP2 41.64481 292.43591 31.732282 1.513175 0.04768567
## 22590 GZMB 41.92950 295.57425 25.815156 1.805253 0.06992996
## 17450 PRF1 49.07920 375.46417 18.904971 1.834687 0.09704784
## 28833 CST7 67.63934 582.69560 18.552820 2.368951 0.12768686
## 8890 GZMA 63.57977 538.12300 16.323033 2.370504 0.14522446
## 26840 CCL4 32.29559 193.02232 14.958437 1.222396 0.08171949
## 31076 NKG7 83.97400 754.41294 12.069899 3.957020 0.32784200
## 4047 GNLY 37.91376 251.79852 11.717988 2.154341 0.18384906
## 8887 GZMK 23.90130 113.95450 7.868179 1.109920 0.14106444
## 7587 HOPX 27.30240 144.40598 7.440028 1.139513 0.15315972
## 26825 CCL5 49.78182 383.38253 6.979398 3.334606 0.47777840
## 19180 CTSW 46.74063 349.16439 6.764457 2.364328 0.34952226
## 29406 GZMM 31.90405 189.06866 5.491018 1.542283 0.28087381
## 7224 LYAR 20.60280 86.98492 4.445071 1.027226 0.23109326
## 2301 CD247 23.18345 107.85497 3.815640 1.365672 0.35791429
## 32017 APOBEC3G 19.49816 78.59332 3.748558 1.018984 0.27183358
## 2238 FCGR3A 16.59790 58.25954 3.639044 1.127840 0.30992758
## 19539 CTSC 15.55194 51.57065 2.724871 1.055996 0.38753983
## 21043 RAP1B 14.47384 45.05539 2.220766 1.243979 0.56015743
## gene tstat log10p fc avg_a avg_b
## 30115 IFI30 49.23862 377.2603 75.19287 1.182536 0.01572670
## 9634 CD14 41.02185 285.5851 46.19765 1.149775 0.02488817
## 440 CDA 40.98333 285.1622 44.38290 1.003595 0.02261221
## 31959 LGALS2 62.38750 524.9180 40.64937 2.012480 0.04950827
## 18938 MS4A6A 41.80989 294.2551 33.61999 1.201869 0.03574864
## 29429 CFD 66.99198 575.6310 31.91044 1.891644 0.05927978
## 26287 CD68 56.52615 459.4096 22.12496 1.616658 0.07306941
## 1958 S100A8 67.49051 581.0730 21.24052 3.070813 0.14457331
## 13509 CFP 48.64243 370.5449 18.48908 1.410348 0.07628007
## 2483 NCF2 39.57790 269.8006 17.76441 1.075129 0.06052150
## 18773 SPI1 49.65047 381.9018 17.68410 1.460989 0.08261596
## 16745 FCN1 77.17353 684.5459 16.59471 2.667224 0.16072737
## 18149 IFITM3 49.38551 378.9155 16.14298 1.729314 0.10712489
## 23264 SERPINA1 53.71164 427.7129 14.49279 1.696494 0.11705778
## 1956 S100A9 84.81491 762.8816 14.27356 3.830041 0.26833120
## 12571 BRI3 41.33746 289.0530 12.63839 1.219273 0.09647373
## 10609 LST1 92.17622 835.3832 12.04173 3.093488 0.25689734
## 27141 GRN 45.98947 340.7425 11.22910 1.506885 0.13419466
## 28818 CST3 114.50953 1037.6725 11.11242 3.915089 0.35231634
## 10611 AIF1 87.64681 791.1209 10.28272 3.148724 0.30621508
## gene tstat log10p fc avg_a avg_b
## 19802 CD3D 38.86684 262.08192 3.801846 2.094871 0.5510143
## 8764 IL7R 27.59258 147.11361 3.766821 1.440688 0.3824678
## 19801 CD3E 30.10199 171.16243 3.239243 1.675256 0.5171750
## 30976 NOSIP 23.28854 108.74029 2.867435 1.332223 0.4646044
## 20398 LDHB 40.13779 275.90430 2.661710 2.400268 0.9017769
## 24758 IL32 28.40130 154.74251 2.636750 2.190333 0.8306944
## 687 LCK 17.61133 65.07381 2.514199 1.034972 0.4116510
## 27884 CD7 16.26144 56.06913 2.343425 1.108243 0.4729159
## 29533 AES 18.43955 70.87793 2.093202 1.375246 0.6570058
## 10608 LTB 28.89421 159.45043 2.092303 2.798037 1.3373004
## 19620 TMEM123 14.32029 44.15960 1.927135 1.080111 0.5604750
## 13126 GIMAP7 15.18029 49.28053 1.766998 1.463028 0.8279738
## 23336 EVL 14.60454 45.82431 1.724737 1.388695 0.8051632
## 2760 TRAF3IP3 12.59349 34.65300 1.688485 1.196732 0.7087606
## 9465 HINT1 16.30733 56.36579 1.632188 1.719157 1.0532837
## 14620 TMEM66 20.07951 82.96723 1.626411 2.074055 1.2752339
## 18147 IFITM1 11.37413 28.58691 1.610993 1.234452 0.7662674
## 32399 SOD1 11.81773 30.73041 1.587203 1.249588 0.7872896
## 10642 C6orf48 11.79417 30.61475 1.550436 1.341331 0.8651315
## 3668 ZFP36L2 14.42680 44.78009 1.538502 1.771961 1.1517443