In this section we perform Differential Expression following our previous analysis on alignment. Differential Expression is basically focused on finding out which genes are either being upregulated or down-regulated between the cases compared to controls.
In our analyis we had 6 samples (samples 37-42), for which samples 37-39 were normal samples and samples 40-42 were diseased samples.
Counts to be used have been generated from two approaches of alignment, ie;
HISAT2
, andKallisto
library(DESeq2)
library(GenomicFeatures) #For creating the tx2gene file for kallisto counts import
library(tximport) #Importing kallisto counts
library(pheatmap) #For generating heatmaps
library(vsn) #for plotting dispersion plot
library("RColorBrewer")
require(genefilter)
require(calibrate)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db) #For including gene symbols
#BiocManager::install("reactome.db")
library(reactome.db) #For gene function annotations
library(AnnotationDbi)
#Reading the metadata file
meta <- read.csv("practice.dataset.metadata.tsv", sep="\t", row.names = 1, header = T, stringsAsFactors = T)
meta
levels(meta$Condition)
## [1] "disease" "normal"
Looking at our metadata file, the levels for the condition are set as disease
followed by normal
because by default R follows alphabetical order to arrange the levels of a factor. However, when using DESeq the first level is treated as the control (in our case its normal), and the second is taken as the case group (in our case the disease).
So we will have to rearrange our levels so that normal is picked as the reference/Control group and disease as the case group.
To achieve this will use the relevel
function and specify normal as our reference.
#Releveling the metadata Conditions
meta$Condition <- relevel(meta$Condition, ref = "normal")
#Check our levels again
levels(meta$Condition)
## [1] "normal" "disease"
We now proceed to import the counts from the 2 programs we used during our alignment in 2 different ways as specified by the DESeq2 manual.
DESeq2 has basically 4 ways of importing counts to create a DESeqDataSet
object that can be used for downstream analysis. These inlcude;
tximport
.htseq-count
files.SummarizedExperiment
object.Featurecounts produces a single matrix file containing the counts for all the samples used. To import feature counts matrix, we use the DESeqDataSetFromMatrix
function from DESeq2 to create an object that can be converted into a DESeq object.
# Importing HISAT2's featurecounts
countdata <- read.csv("Hisat_featurecounts/hisat_counts.txt", sep="\t", header=T, row.names=1, comment.char = "#")
#countdata <- read.csv("hisat2/hisat2_counts.txt", sep="\t", header=T, row.names=1, comment.char = "#")
head(countdata)
#Check the column names
colnames(countdata)
## [1] "Chr" "Start"
## [3] "End" "Strand"
## [5] "Length" "sample37_hisat_sorted.bam"
## [7] "sample38_hisat_sorted.bam" "sample39_hisat_sorted.bam"
## [9] "sample40_hisat_sorted.bam" "sample41_hisat_sorted.bam"
## [11] "sample42_hisat_sorted.bam"
#Remove the unwanted columns
countdata[c("Chr", "Start", "End", "Strand", "Length")] <- NULL
#Renaming the Colnames (to remove the suffix "_hisat_sorted.bam")
#colnames(countdata) <- gsub("hisat2", "", colnames(countdata))
#colnames(countdata) <- gsub("_hisat2_sorted.bam", "", colnames(countdata))
colnames(countdata) <- gsub("_hisat_sorted.bam", "", colnames(countdata))
head(countdata)
#Check if samples in the countdata matrix have a corresponding entry in the metadata file
all(rownames(meta) %in% colnames(countdata))
## [1] TRUE
#Check if the order of the samples in the matrix is similar to that in the metadata file.
all(colnames(countdata) == rownames(meta))
## [1] TRUE
#In case the Order is not the same, we shall rearrange our count matrix using rownames of meta using;
#countdata <- countdata[, rownames(meta)]
#Importing the data into DESeqDataSet Object
dds <- DESeqDataSetFromMatrix(countData = countdata,
colData = meta,
design = ~ Condition)
dds
## class: DESeqDataSet
## dim: 60683 6
## metadata(1): version
## assays(1): counts
## rownames(60683): ENSG00000223972 ENSG00000227232 ... ENSG00000277475
## ENSG00000268674
## rowData names(0):
## colnames(6): sample37 sample38 ... sample41 sample42
## colData names(1): Condition
Kallisto unlike featurecounts, it outputs each sample’s count matrix in a separate file. We then use tximport to help us gather all these sample files from the provided directory so that DESeq2 can concatenate them into a single DESeqdataset object used for downstream analysis.
One extra file required by tximport is a file that maps the transcript IDs in our alignment to the gene IDs, which we shall call tx2gene
. This file is created using the GenomicFeatures
Bioconductor package, which requires the annotation of the tracscriptome used during the pseudo alignment.
# Kallisto
#Download the annotation file (Run this in case the file is not yet downloaded.)
url="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gff3.gz"
#download.file(url = url, "gencode.v34.annotation.gff3.gz")
#Create a table that links transcripts to genes for this dataset.
#We use the GenomicFeatures for this to create the tx2gene
# We create the TxDb obeject from the transcript annotation in the gff3 format
txdb <- makeTxDbFromGFF("gencode.v34.annotation.gff3.gz")
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
## stop_codon. This information was ignored.
## OK
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: gencode.v34.annotation.gff3.gz
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 228048
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2020-08-26 22:47:52 +0300 (Wed, 26 Aug 2020)
## # GenomicFeatures version at creation time: 1.41.2
## # RSQLite version at creation time: 2.2.0
## # DBSCHEMAVERSION: 1.2
gene_id <- keys(txdb, keytype = "GENEID") #Extracct the GeneIDs
#Matching the transcripts with the geneIDs
tx2gene <- AnnotationDbi::select(txdb, keys = gene_id, keytype = "GENEID", columns = "TXNAME")
## 'select()' returned 1:many mapping between keys and columns
Now that we have out tx2gene
object ready, we go ahead to read the counts using tximport. We use the DESeqDataSetFromTximport
function from DESeq2 to achive this.
#Dircetory for Kallisto counts
kal_dir <- "Kallisto"
files <- file.path(kal_dir, rownames(meta),"abundance.h5" )
names(files) <- rownames(meta)
files
## sample37 sample38
## "Kallisto/sample37/abundance.h5" "Kallisto/sample38/abundance.h5"
## sample39 sample40
## "Kallisto/sample39/abundance.h5" "Kallisto/sample40/abundance.h5"
## sample41 sample42
## "Kallisto/sample41/abundance.h5" "Kallisto/sample42/abundance.h5"
#Reading the files for the different samples
txi <- tximport(files, type="kallisto",tx2gene = tx2gene, txOut = T)
## 1 2 3 4 5 6
names(txi)
## [1] "abundance" "counts" "infReps"
## [4] "length" "countsFromAbundance"
#Construct the DESeqDataSet object
kallisto_dds <- DESeqDataSetFromTximport(txi,
colData = meta,
design = ~ Condition)
## using counts and average transcript lengths from tximport
kallisto_dds
## class: DESeqDataSet
## dim: 228048 6
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(228048):
## ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
## ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|
## ...
## ENST00000387460.2|ENSG00000210195.2|-|-|MT-TT-201|MT-TT|66|Mt_tRNA|
## ENST00000387461.2|ENSG00000210196.2|-|-|MT-TP-201|MT-TP|68|Mt_tRNA|
## rowData names(0):
## colnames(6): sample37 sample38 ... sample41 sample42
## colData names(1): Condition
In our analysis, we shall focus more on the counts produced by HISAT2 and featurecounts as our primary results for the differential expression analysis. If one is interested in usisng the output from kallisto as the primary results, then simply rename the names of the kallisto object to that being used here (ie that generated by the hisat2)
We shall pass our DESeqDataset object through the DESeq() function which will first normalize our counts, then performs an estimation of size factors (which control for differences in the library size of the sequencing experiments) and finally the estimation of dispersion for each gene by fitting the object to a generalized linear model,glm and producing the corresponding pvalues at a 90% confidence.
Normalization of counts is such an import step because of 2 main reasons;
We then use the result()
function to extract out the results from the DESeq() output.
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#Extracting the results from the formed dds object
res <- results(dds)
head(res, 10)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000223972 0.000000 NA NA NA NA
## ENSG00000227232 71.877192 -0.617388 0.547883 -1.1268609 0.2598013
## ENSG00000278267 9.328531 -2.539808 1.067946 -2.3782174 0.0173966
## ENSG00000243485 0.152211 0.780782 4.080473 0.1913459 0.8482546
## ENSG00000284332 0.000000 NA NA NA NA
## ENSG00000237613 0.333464 1.771850 4.030219 0.4396412 0.6601970
## ENSG00000268020 0.000000 NA NA NA NA
## ENSG00000240361 0.000000 NA NA NA NA
## ENSG00000186092 0.000000 NA NA NA NA
## ENSG00000238009 1.382485 -0.177357 3.283507 -0.0540145 0.9569236
## padj
## <numeric>
## ENSG00000223972 NA
## ENSG00000227232 0.574020
## ENSG00000278267 0.137254
## ENSG00000243485 NA
## ENSG00000284332 NA
## ENSG00000237613 NA
## ENSG00000268020 NA
## ENSG00000240361 NA
## ENSG00000186092 NA
## ENSG00000238009 NA
The res object generated has 6 columns with the rownames being the individual genes.
And for values assigned as NAs are based on the criteria below;
#Generate a summary of res object
summary(res)
##
## out of 38260 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1497, 3.9%
## LFC < 0 (down) : 1004, 2.6%
## outliers [1] : 287, 0.75%
## low counts [2] : 12931, 34%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
The summary of res above gives us an idea of the possible number of differentially expressed genes (both upregulated and downregulated), as well possible counts of outliers in the our data at 0.1 significance level by default. From this we can see that we have 1497 upregulated genes (constituting about 3.9%) and 1004 downregulated genes (constituting about 2.6%) whereas 287 (0.75) and 12931 (34%) of the counts are considered as outliers and low counts respectively.
A quality assessment of our the results is one of the most crucial steps reccommended to be done before any further analysis is performed. This typically informs us how good or bad our data is and how best it is suited for the analysis to be performed and whether any results need to filtered out.
However with DESeq2, this is not so much of a major concern since the quality control is already implemented in the package function DESeq() that caters for any low counts and outliers.
Here we shall simply use plots to give us an idea of the data we are planning to use for our downstream analysis.
The 2 common plots to achieve this include the heatmap and PCA.
For generating heatmaps and more quality plots it is reccommended to first transform our data.
Testing for DE operates on raw counts, however for visualizations and clusterings, its better to work with transformed count data. There are 2 common waays of transformation; which produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.
The goal of transformation is to eliminate any variance arising when the mean is low.
One common argument used during transformation is blind which tells whether transformation should be blind to sample information specified in the the design (ie in an unbiased manner), hence using only the intercept. However this is not the best choice especially if the counts difference are attributed to the design and if one is to use the transformed data for downstream analysis. If blind = FALSE
is set, we take into account already estimated dispersion due to design as we are transforming the data, and make the process much faster.
In Comparison: VST runs faster than rlog. If the library size of the samples and therefore their size factors vary widely, the rlog transformation is a better option than VST. Both options produce log2 scale data which has been normalized by the DESeq2 method with respect to library size.
#Count data Transformations
vsd <- vst(dds, blind=FALSE) #computing for Variance Stabilizing Transformation
rld <- rlog(dds, blind=FALSE) #Computing for regularised logarithm
#A look at the generated matrices in contrast to the original
#Original count matrix
head(countdata)
#Count matrix for vsd
head(assay(vsd)) #Assay extracts a matrix of normalized counts
## sample37 sample38 sample39 sample40 sample41 sample42
## ENSG00000223972 5.281749 5.281749 5.281749 5.281749 5.281749 5.281749
## ENSG00000227232 7.177809 6.902050 7.635498 6.856577 7.026329 6.915044
## ENSG00000278267 6.020474 6.088279 6.404361 5.593753 5.698105 5.694927
## ENSG00000243485 5.281749 5.281749 5.281749 5.502584 5.281749 5.281749
## ENSG00000284332 5.281749 5.281749 5.281749 5.281749 5.281749 5.281749
## ENSG00000237613 5.281749 5.281749 5.281749 5.502584 5.522688 5.281749
#Count matrix produced by rlog
head(assay(rld))
## sample37 sample38 sample39 sample40 sample41 sample42
## ENSG00000223972 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## ENSG00000227232 6.210644 5.891107 6.701883 5.835198 6.038031 5.905750
## ENSG00000278267 3.095538 3.190424 3.618631 2.546669 2.670309 2.660245
## ENSG00000243485 -1.979000 -1.982334 -1.988459 -1.960492 -1.982850 -1.986563
## ENSG00000284332 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## ENSG00000237613 -1.503514 -1.510533 -1.523192 -1.464545 -1.460042 -1.519438
Effects of transformation on the variance We will use the dispersion plot to explore the effects of transformation on the variance in our data. This plots the standard deviation of the transformed data across samples, against the mean, using the shifted logarithm transformation, the regularized log transformation and the variance stabilizing transformation.
What we expect: The shifted logarithm has elevated standard deviation in the lower count range, and the regularized log to a lesser extent, while for the variance stabilized data the standard deviation is roughly constant along the whole dynamic range.
library(vsn)
#Dispersion Plots
#for the untransformed dds
ntd <- normTransform(dds)
meanSdPlot(assay(ntd))
#for vst
meanSdPlot(assay(vsd))
#For rlog
meanSdPlot(assay(rld))
We generally do not see much change between rlog and vst in this case, however there is a noticiable improvement in the the variance as compared to the untransformed data, shown by a fairly lower fitted line.
We will use the vst transformed data obtained from above for the plots, since its the most reccomended approach.
We will first use the heatmap to explore the quality for the count matrix.
select <- order(rowMeans(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds))
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
This heatmap gives us an overview of the similarities and dissimilarities between the samples.
#Calculate sample-2-sample distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists) #Converting the dist object to matrix
rownames(sampleDistMatrix) <- vsd$Condition
colnames(sampleDistMatrix) <- rownames(meta)
#Defining the colors
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
From this matrix we can generally observe a larger simarity between intra-sample groups eg more similarity between samples 40-41 and samples 37-39, and we can also note major dissimilarities between disease group compared to normal samples shown by much lighter colors.
However we do note one unexpected difference here. Sample37 which belongs to the normal samples seems to be very different from the other samples both in the same category and different category as well, which is one thing we can’t have a conclusive answer for yet as to wether the changes were generated after the experiment or they are phsyiological changes.
This plot is similar to the distane matrix above which will basically cluster the samples basing on their properties.
This type of plot is useful for visualizing the overall effect of experimental covariates, relatedness and batch effects.
plotPCA(vsd, intgroup="Condition")
From this plot we generally see disease samples clustering together, and 2 of the normal samples clustering together apart from one.
We will redraw the plots to include their sample labels to be able to tell which sample is generally different among the normal.
#Including samples in the pca plot
#PCA Analysis
mycols <- brewer.pal(8, "Dark2")[1:length(unique(meta$Condition))]
rld_pca <- function (rld, intgroup = "Condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) {
require(RColorBrewer)
rv = rowVars(assay(rld))
select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
pca = prcomp(t(assay(rld)[select, ]))
fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "))
if (is.null(colors)) {
if (nlevels(fac) >= 3) {
colors = brewer.pal(nlevels(fac), "Paired")
} else {
colors = c("black", "red")
}
}
pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
pc2var <- round(summary(pca)$importance[2,2]*100, digits=1)
pc1lab <- paste0("PC1 (",as.character(pc1var),"%)")
pc2lab <- paste0("PC1 (",as.character(pc2var),"%)")
plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
legend(legendpos, legend=levels(fac), col=colors, pch=20)
}
rld_pca(rld, colors=mycols, intgroup="Condition", xlim=c(-75, 35))
After adding labels, we can clearly tell that indeed it was sample37 that was clustering differently compared to the rest.
This plot generally show how gene expression data differs across samples in the same treatment group.
#Dispersion plot
plotDispEsts(dds, main="Dispersion plot for DESeq")
This plot is a quick visual for genes that are upregulated as well those downregulated, possible outliers, and those condidered not significant. Its a better practice to plot the shrinked values for this this plot as compared to the initial results.
To generate plots and ranking of genes, its better we first perform a Shrinkage of effect size (ie the LFC estimates) that helps in shrinking any low counts/ low effect size that are close to zero towards the zero line. This helps in cleaning out the plots (very useful for the MA plot) for easier interpretation without even the need of filtering out low counts in the results.
There are generally 3 methods for performing effect size shrinkage ie using the
apeglm
which is usally the default method,ashr
andnormal
which was the initial method implemented by DESeq2. Among this the best recommended is the apeglm and most discouraged for use is normal.
For exploration purposes we will test all of the above methods on our results to better understand the difference it makes (if any).
One thing we need to provide to the lfcshrink()
function on top of the method to use is the coefficient obtained from the resultsNames()
which basically tells the shrink function what Condition is to be compared against what.
#Obtain the coefficient to use for shrinkage
resultsNames(dds)
## [1] "Intercept" "Condition_disease_vs_normal"
In our case its Condition_disease_vs_normal
or the second object in the output which basically tells us that we are comparing diseased samples against normal samples.
#Computing for size shrikage using apeglm
resLFC <- lfcShrink(dds, coef="Condition_disease_vs_normal", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
head(resLFC)
## log2 fold change (MAP): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000223972 0.000000 NA NA NA NA
## ENSG00000227232 71.877192 -0.3271584 0.433315 0.2598013 0.574020
## ENSG00000278267 9.328531 -1.1171275 1.567943 0.0173966 0.137254
## ENSG00000243485 0.152211 0.0341144 0.535381 0.8482546 NA
## ENSG00000284332 0.000000 NA NA NA NA
## ENSG00000237613 0.333464 0.0441123 0.536771 0.6601970 NA
Lets now generate the MA plot for the shrinked value.
#MA plot for shrinked values
plotMA(resLFC, ylim=c(-2,2))
This plot basically has 3 features of our interest explained below. Each dot represented here is a gene.
grey
dots represent the genes whose LFC is considered not signicant to our condition.blue
dots represent genes whose LFC is considered significant. Dots above the line represent upregulated genes, and dots below the line represent downregulated genesWe shall now visualize the MA plot with first the initial res object where we hadn’t performed the size shrinkage and compare it to when we have performed shrinkage using apeglm.
# comparison of MA plot for unshrinked values and shrinked valued
par(mfrow=c(1,2))
plotMA(res, ylim=c(-2,2), main="Unshrunken") #Unshrinked values
plotMA(resLFC, ylim=c(-2,2), main="Shrunken ") #Shrinked values
One quick observation we can give from the plots above is that we have alot of unsignificants genes being represented in the unshrunken plot as compared to the shrunken. This is the exact reason why we perform size shrinkage, ie clear the plots and make it more easily interpretable.
Compare the shrinkage methods
We will now compute the size shrinkage using the other 2 methods and determine the difference it would make.
#Alternative shrinkage estimators
#using the ashr method
resAsh <- lfcShrink(dds, coef=2, type="ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
#Using the normal method
resNorm <- lfcShrink(dds, coef=2, type="normal")
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
#Generating MA plots comparing the 3 shrinkage estimators
#mar sets the margin sizes in the following order: bottom, left, top, right
par(mfrow=c(1,3), mar=c(4,4,2,1))
xlim <- c(1,1e5); ylim <- c(-3,3)
plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm")
plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal")
plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr")
Among these plots, the plot for shrinakge using the normal estimate contains more unsignificant genes followed by ashr which would probably explain why apeglm is a better prefered estimator.
Lets first make a histogram plot of p-value frequency.
#Exploring p-value frequency
hist(res$pvalue, breaks=50, col="grey")
We can see from this that we have some good number of genes whose pvalue is significant (ie fall below 0.1), however most of the pvalues for the genes is not considered significant.
From our res object we will go head to only subset out the significant results. We will consider a gene’s expression significant if its pvalue is below 0.05, ie 95% confidence.
#Reorder res based on pvalues
resOrdered <- res[order(res$pvalue),]
head(resOrdered)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000039537 431.823 6.92551 0.731458 9.46809 2.85007e-21
## ENSG00000124237 241.613 7.98886 0.855855 9.33435 1.01608e-20
## ENSG00000160401 507.498 6.03790 0.658602 9.16775 4.82988e-20
## ENSG00000007908 1332.119 6.21421 0.693271 8.96362 3.14198e-19
## ENSG00000188817 409.116 5.82887 0.667047 8.73832 2.36594e-18
## ENSG00000168658 330.530 11.73927 1.400477 8.38234 5.18865e-17
## padj
## <numeric>
## ENSG00000039537 7.13715e-17
## ENSG00000124237 1.27223e-16
## ENSG00000160401 4.03166e-16
## ENSG00000007908 1.96703e-15
## ENSG00000188817 1.18496e-14
## ENSG00000168658 1.78339e-13
#Only subsettting genes with Significant padj value
resSig <- subset(resOrdered, padj < 0.05)
#Sum of significant genes
nrow(resSig)
## [1] 1671
This plots the normalized counts for each sample in the particular gene
Lets start by plotting the counts of the most significant gene ie the one with the lowest pvalue.
#For the gene with the minimum padj
plotCounts(dds, gene=which.min(res$padj), intgroup="Condition")
From this we can see that normal samples are having a fairly equal count for the gene as well as the disease samples, however there is a clear cut difference between the counts for this gene in the 2 categories.
Our interests will now be generating a count for the top 6 significant genes, the lowest 6 significant genes, as well as the unsignificant genes for better contrast.
#Top 6 significant genes
head(resSig)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000039537 431.823 6.92551 0.731458 9.46809 2.85007e-21
## ENSG00000124237 241.613 7.98886 0.855855 9.33435 1.01608e-20
## ENSG00000160401 507.498 6.03790 0.658602 9.16775 4.82988e-20
## ENSG00000007908 1332.119 6.21421 0.693271 8.96362 3.14198e-19
## ENSG00000188817 409.116 5.82887 0.667047 8.73832 2.36594e-18
## ENSG00000168658 330.530 11.73927 1.400477 8.38234 5.18865e-17
## padj
## <numeric>
## ENSG00000039537 7.13715e-17
## ENSG00000124237 1.27223e-16
## ENSG00000160401 4.03166e-16
## ENSG00000007908 1.96703e-15
## ENSG00000188817 1.18496e-14
## ENSG00000168658 1.78339e-13
#Plot counts for the top 6
par(mfrow=c(2,3))
plotCounts(dds, gene="ENSG00000039537", intgroup="Condition")
plotCounts(dds, gene="ENSG00000124237", intgroup="Condition")
plotCounts(dds, gene="ENSG00000160401", intgroup="Condition")
plotCounts(dds, gene="ENSG00000007908", intgroup="Condition")
plotCounts(dds, gene="ENSG00000188817", intgroup="Condition")
plotCounts(dds, gene="ENSG00000168658", intgroup="Condition")
#Lowest 6 Significant genes
tail(resSig)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000221829 333.2119 -1.73864 0.591901 -2.93738 0.00331003
## ENSG00000196205 10767.8096 1.30891 0.445730 2.93655 0.00331885
## ENSG00000277791 1475.5847 -1.75008 0.596127 -2.93575 0.00332746
## ENSG00000146700 46.0253 -2.40770 0.820261 -2.93529 0.00333235
## ENSG00000179715 505.7102 -1.81465 0.618222 -2.93527 0.00333258
## ENSG00000178999 88.2522 -2.81577 0.959294 -2.93526 0.00333273
## padj
## <numeric>
## ENSG00000221829 0.0497538
## ENSG00000196205 0.0498565
## ENSG00000277791 0.0499451
## ENSG00000146700 0.0499451
## ENSG00000179715 0.0499451
## ENSG00000178999 0.0499451
par(mfrow=c(2,3))
plotCounts(dds, gene="ENSG00000221829", intgroup="Condition")
plotCounts(dds, gene="ENSG00000196205", intgroup="Condition")
plotCounts(dds, gene="ENSG00000277791", intgroup="Condition")
plotCounts(dds, gene="ENSG00000146700", intgroup="Condition")
plotCounts(dds, gene="ENSG00000179715", intgroup="Condition")
plotCounts(dds, gene="ENSG00000178999", intgroup="Condition")
#Confirming the plots with that of unsignificant genes
tail(na.omit(resOrdered))
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000070759 119.6107 -3.45224e-04 0.552954 -0.000624328 0.999502
## ENSG00000236204 43.0285 -3.28937e-04 0.819029 -0.000401618 0.999680
## ENSG00000119403 737.9646 2.70902e-04 0.733961 0.000369096 0.999706
## ENSG00000198416 25.1621 -2.94144e-04 1.266748 -0.000232204 0.999815
## ENSG00000157625 1484.0039 4.87119e-05 0.346662 0.000140517 0.999888
## ENSG00000107679 2538.6277 3.49144e-05 0.559391 0.000062415 0.999950
## padj
## <numeric>
## ENSG00000070759 0.999701
## ENSG00000236204 0.999825
## ENSG00000119403 0.999825
## ENSG00000198416 0.999895
## ENSG00000157625 0.999928
## ENSG00000107679 0.999950
par(mfrow=c(2,3))
plotCounts(dds, gene="ENSG00000070759", intgroup="Condition")
plotCounts(dds, gene="ENSG00000236204", intgroup="Condition")
plotCounts(dds, gene="ENSG00000119403", intgroup="Condition")
plotCounts(dds, gene="ENSG00000198416", intgroup="Condition")
plotCounts(dds, gene="ENSG00000157625", intgroup="Condition")
plotCounts(dds, gene="ENSG00000107679", intgroup="Condition")
After us discovering which genes are differentially expressed in the 2 connditions, our next goal will now be to determine the function they play and determine whether the biological process played by the gene has an attachment to the condition of study.
To achieve this we will first need to attach gene IDs to the pre-existing Ensembl gene IDs. These gene IDs will later be used to access the corresponding entry information for a particular gene including its function.
We will make use of the functions obtained from the tutorial here.
We shall use the org.Hs.eg.db
package that has information stored in as AnnotationDbi database. This contains information linking to a number of databases however our primary interest is with the Ensembl database.
#list of all available key types
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
## [26] "UNIPROT"
convertIDs <- function( ids, fromKey, toKey, db, ifMultiple=c( "putNA", "useFirst" ) ) {
stopifnot( inherits( db, "AnnotationDb" ) )
ifMultiple <- match.arg( ifMultiple )
suppressWarnings( selRes <- AnnotationDbi::select(
db, keys=ids, keytype=fromKey, columns=c(fromKey,toKey) ) )
if( ifMultiple == "putNA" ) {
duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ]
selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ] }
return( selRes[ match( ids, selRes[,1] ), 2 ] )
}
res1 <- res #making a backup
res$hgnc_symbol <- convertIDs( row.names(res), "ENSEMBL", "SYMBOL", org.Hs.eg.db )
res$entrezid <- convertIDs( row.names(res), "ENSEMBL", "ENTREZID", org.Hs.eg.db )
head(res, 10)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 10 rows and 8 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000223972 0.000000 NA NA NA NA
## ENSG00000227232 71.877192 -0.617388 0.547883 -1.1268609 0.2598013
## ENSG00000278267 9.328531 -2.539808 1.067946 -2.3782174 0.0173966
## ENSG00000243485 0.152211 0.780782 4.080473 0.1913459 0.8482546
## ENSG00000284332 0.000000 NA NA NA NA
## ENSG00000237613 0.333464 1.771850 4.030219 0.4396412 0.6601970
## ENSG00000268020 0.000000 NA NA NA NA
## ENSG00000240361 0.000000 NA NA NA NA
## ENSG00000186092 0.000000 NA NA NA NA
## ENSG00000238009 1.382485 -0.177357 3.283507 -0.0540145 0.9569236
## padj hgnc_symbol entrezid
## <numeric> <character> <character>
## ENSG00000223972 NA DDX11L1 100287102
## ENSG00000227232 0.574020 NA NA
## ENSG00000278267 0.137254 MIR6859-1 102466751
## ENSG00000243485 NA NA NA
## ENSG00000284332 NA MIR1302-2 100302278
## ENSG00000237613 NA FAM138A 645520
## ENSG00000268020 NA NA NA
## ENSG00000240361 NA NA NA
## ENSG00000186092 NA OR4F5 79501
## ENSG00000238009 NA LOC100996442 100996442
We now have the hgnc_symbol
added to our genes. We will use this to search for the corresponding gene function in the reactome database.
The goal of this is to examine whether the observed genes with a strong/significant up- or down-regulation have either something in common or have something to do with the biological process causing the condition.
#Extracting only genes that have corresponding data in reactome db and padj value is not an NA
res2 <- res[ res$entrezid %in% keys( reactome.db, "ENTREZID" ) & !is.na( res$padj ) , ]
head(res2)
## log2 fold change (MLE): Condition disease vs normal
## Wald test p-value: Condition disease vs normal
## DataFrame with 6 rows and 8 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000188976 1426.2300 -0.472664 0.395878 -1.193964 0.23249214
## ENSG00000187642 10.3347 -1.514005 1.689232 -0.896268 0.37010940
## ENSG00000187608 458.4554 -1.542325 0.662844 -2.326828 0.01997440
## ENSG00000188157 5994.5542 -0.601310 0.341325 -1.761694 0.07812093
## ENSG00000162571 77.5936 2.363003 0.796479 2.966812 0.00300905
## ENSG00000186891 119.8685 -2.817709 0.929541 -3.031289 0.00243512
## padj hgnc_symbol entrezid
## <numeric> <character> <character>
## ENSG00000188976 0.5427853 NOC2L 26155
## ENSG00000187642 0.6758754 PERM1 84808
## ENSG00000187608 0.1492830 ISG15 9636
## ENSG00000188157 0.3184607 AGRN 375790
## ENSG00000162571 0.0469611 TTLL10 254173
## ENSG00000186891 0.0411751 TNFRSF18 8784
#Next we use select for AnnotationDbi to map entrez_id to Reactome_IDs
reactomeTable <- AnnotationDbi::select( reactome.db,
keys=as.character(res2$entrezid), keytype="ENTREZID",
columns=c("ENTREZID","REACTOMEID") )
head(reactomeTable, 10)
#We then tell which genes are members of which Reactome Paths.
incm <- do.call( rbind, with(reactomeTable, tapply(
ENTREZID, factor(REACTOMEID), function(x) res2$entrezid %in% x ) ))
colnames(incm) <- res2$entrez
str(incm)
## logi [1:2230, 1:9089] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2230] "R-HSA-1059683" "R-HSA-109581" "R-HSA-109582" "R-HSA-109606" ...
## ..$ : NULL
#We remove all rows corresponding to Reactome Paths with less than 20 or more than 80 assigned genes.
within <- function(x, lower, upper) (x>=lower & x<=upper)
incm <- incm[ within(rowSums(incm), lower=20, upper=80), ]
#We test whether the genes in a Reactome Path behave in a special way in our experiment,
testCategory <- function(reactomeID) {
isMember <- incm[ reactomeID, ]
data.frame(
reactomeID = reactomeID,
numGenes = sum( isMember ),
avgLFC = mean( res2$log2FoldChange[isMember] ),
sdLFC = sd( res2$log2FoldChange[isMember] ),
zValue = mean( res2$log2FoldChange[isMember] ) /sd( res2$log2FoldChange[isMember] ),
strength = sum( res2$log2FoldChange[isMember] ) / sqrt(sum(isMember)),
pvalue = t.test( res2$log2FoldChange[ isMember ] )$p.value,
reactomeName = reactomePATHID2NAME[[reactomeID]],
stringsAsFactors = FALSE ) }
#We call the function for all Paths in our incidence matrix and collect the results in a data frame:
reactomeResult <- do.call( rbind, lapply( rownames(incm), testCategory ) )
head(reactomeResult)
#Now we perform Multiple Testing Adjustment again since we had to compare multiple elememts
reactomeResult$padjust <- p.adjust( reactomeResult$pvalue, "BH" )
#We obtain the reactomeID with significant p.adj values in our comparison of normal vs disease
reactomeResultSignif <- reactomeResult[ reactomeResult$padjust < 0.05, ]
head( reactomeResultSignif[ order(-reactomeResultSignif$strength), ] ,20)[c("reactomeID", "numGenes" ,"reactomeName")]
Since the samples that were used for this analysis to develop the script had no accompayning metedata details about the nature of the disease under investigation, we can not really do much going forward. However in case one knew the condition under study as well an idea of the biological processes probably affecetd by the condition, they would go ahead and subset out the genes that have a function associated with that.
As a Conluding step, we will export our significant results both before the gene annotation and after annotation to include the gene functions to a local file.
We will use the write.table()
function to save a tab delimited file of a dataframe version of the results.
#Exporting the resSig genes
write.table(as.data.frame(resSig),
file="Condition_significant_results.csv")
#Exporting the genes containing their annotation
write.table(as.data.frame(reactomeResultSignif),
file="Gene_function_annotated_significant_results.csv")