Question

Filtering Low Count Genes

0

Entering edit mode

6.2 years ago

gokce.ouz ▴ 70

Hi All,

I am working on TCGA Breast Cancer dataset. I downloaded the Count data using TCGA2STAT.

counts.brca <- getTCGA(disease="BRCA", data.type="RNASeq", type="count")

Extracted the mutation data and clinical data using library("cgdsr")

For downstream analysis I am using DESeq2.

ddsMat<- DESeqDataSetFromMatrix(tcga_comb, sampletable, design=~mut.status)
dds<-DESeq(ddsMat)
dds<-DESeq(dds)
res<-results(dds, contrast= c("mutvswt", "Mut", "WT"))
resOrdered <- res[order(res$log2FoldChange,decreasing=TRUE),]
resOrdered   
log2 fold change (MLE): mutvswt Mut vs WT 
Wald test p-value: mutvswt Mut vs WT 
DataFrame with 20502 rows and 6 columns
               baseMean log2FoldChange     lfcSE      stat       pvalue
              <numeric>      <numeric> <numeric> <numeric>    <numeric>
RPL13AP17      3.544641       2.310630 0.3485938  6.628432 3.392722e-11
PNLIP          1.872278       1.916453 0.4180853  4.583879 4.564283e-06
CST4         616.052645       1.806587 0.1851868  9.755485 1.747628e-22
LOC100287718   2.395814       1.653026 0.3204551  5.158369 2.491100e-07
SEMG2          1.345911       1.647315 0.4141256  3.977814 6.955174e-05
                      padj
                <numeric>
RPL13AP17    1.446313e-09
PNLIP        4.743272e-05
CST4         4.715005e-20
LOC100287718 3.851038e-06
SEMG2        4.909754e-04

However, when I plotted heatmap of the upregulated genes. I did not see clear clustering between Mutant vs WT so I checked the raw count of top upregulated gene(RPL13AP17), I observed only 4 patients have expression higher than 1000 and the rest is super low which I think shouldn't pop-up as significantly upregulated gene. Also the 2nd top upregulated gene (PNLIP) was similar.

select<- rownames(subset(res, padj < 0.001 & log2FoldChange >1))
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE)

summary(res)
        out of 20490 with nonzero total read count
    adjusted p-value < 0.1
    LFC > 0 (up)     : 1906, 9.3% 
    LFC < 0 (down)   : 6424, 31% 
    outliers [1]     : 0, 0% 
    low counts [2]   : 795, 3.9% 
    (mean count < 0)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

table(assay(dds)['RPL13AP17',])

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 452   73   75   16   19    2    4    2    4    2    1    2    2    1    1    2 
  19   25   26   27   47   89   93  132  228  369  488  577  731 1737 1754 1983 
   1    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
6852 
   1 
table(assay(dds)['PNLIP',])

  0   1   2   3   4   5   6   7   8   9  10  12  13  15  16  17  18  20  22  25 
526  29  50   4  13   1   9   1   4   1   3   5   2   2   1   1   2   2   1   1 
 26  30  36  42  45  50  51  60  70  73  76  79  83 110 111 131 423 
  2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

I thought independent filtering was taking care of these low count genes in results() function. I will be really glad if you could suggest me how can I remove these genes to get the distinctly differentially expressed genes.

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

Matrix products: default
BLAS: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRblas.so
LAPACK: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hexbin_1.27.1              TCGA2STAT_1.2             
 [3] pheatmap_1.0.8             genefilter_1.58.1         
 [5] gplots_3.0.1               RColorBrewer_1.1-2        
 [7] vsn_3.44.0                 org.Hs.eg.db_3.4.1        
 [9] DESeq2_1.16.1              BiocParallel_1.10.1       
[11] GenomicAlignments_1.12.2   SummarizedExperiment_1.6.5
[13] DelayedArray_0.2.7         matrixStats_0.52.2        
[15] GenomicFeatures_1.28.5     AnnotationDbi_1.38.2      
[17] Biobase_2.36.2             Rsamtools_1.28.0          
[19] Biostrings_2.44.2          XVector_0.16.0            
[21] GenomicRanges_1.28.5       GenomeInfoDb_1.12.2       
[23] IRanges_2.10.3             S4Vectors_0.14.5          
[25] BiocGenerics_0.22.0        BiocInstaller_1.28.0      

loaded via a namespace (and not attached):
 [1] bit64_0.9-7             splines_3.4.1           gtools_3.5.0           
 [4] Formula_1.2-2           affy_1.54.0             latticeExtra_0.6-28    
 [7] blob_1.1.0              GenomeInfoDbData_0.99.0 RSQLite_2.0            
[10] backports_1.1.1         lattice_0.20-35         limma_3.32.7           
[13] digest_0.6.12           checkmate_1.8.4         colorspace_1.3-2       
[16] preprocessCore_1.38.1   htmltools_0.3.6         Matrix_1.2-11          
[19] R.oo_1.21.0             plyr_1.8.4              pkgconfig_2.0.1        
[22] XML_3.98-1.9            biomaRt_2.32.1          zlibbioc_1.22.0        
[25] xtable_1.8-2            scales_0.5.0            gdata_2.18.0           
[28] affyio_1.46.0           htmlTable_1.9           tibble_1.3.4           
[31] annotate_1.54.0         ggplot2_2.2.1           nnet_7.3-12            
[34] lazyeval_0.2.0          survival_2.41-3         magrittr_1.5           
[37] memoise_1.1.0           R.methodsS3_1.7.1       foreign_0.8-69         
[40] tools_3.4.1             data.table_1.10.4       stringr_1.2.0          
[43] munsell_0.4.3           locfit_1.5-9.1          cluster_2.0.6          
[46] compiler_3.4.1          caTools_1.17.1          rlang_0.1.2            
[49] grid_3.4.1              RCurl_1.95-4.8          htmlwidgets_0.9        
[52] labeling_0.3            bitops_1.0-6            base64enc_0.1-3        
[55] gtable_0.2.0            DBI_0.7                 gridExtra_2.3          
[58] knitr_1.17              rtracklayer_1.36.4      bit_1.1-12             
[61] Hmisc_4.0-3             KernSmooth_2.23-15      stringi_1.1.5          
[64] Rcpp_0.12.13            geneplotter_1.54.0      rpart_4.1-11           
[67] acepack_1.4.1

DESeq2 TCGA2STAT RNA-Seq R • 4.3k views

ADD COMMENT • link 6.2 years ago by gokce.ouz ▴ 70

0

Entering edit mode

Try basing the independent filtering on the median count rather than mean or sum.

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

Thanks for your reply. I tried your suggestion, but I have doubts about my method. I made the change at the results() function as :

res<-results(dds, contrast= c("mutvswt", "Mut", "WT"), filter= rowMedians(counts(dds)))

Is this the correct approach ?

res[order(res$log2FoldChange, res$padj, decreasing=TRUE), ]
log2 fold change (MLE): mutvswt Mut vs WT 
Wald test p-value: mutvswt Mut vs WT 
DataFrame with 20502 rows and 6 columns
               baseMean log2FoldChange     lfcSE      stat       pvalue
              <numeric>      <numeric> <numeric> <numeric>    <numeric>
RPL13AP17      3.544641       2.310630 0.3485938  6.628432 3.392722e-11
PNLIP          1.872278       1.916453 0.4180853  4.583879 4.564283e-06
CST4         616.052645       1.806587 0.1851868  9.755485 1.747628e-22
LOC100287718   2.395814       1.653026 0.3204551  5.158369 2.491100e-07
SEMG2          1.345911       1.647315 0.4141256  3.977814 6.955174e-05
...                 ...            ...       ...       ...          ...
SNORD115-16           0             NA        NA        NA           NA
SNORD115-31           0             NA        NA        NA           NA
SNORD115-37           0             NA        NA        NA           NA
SNORD115-39           0             NA        NA        NA           NA
SNORD115-3            0             NA        NA        NA           NA
                     padj
                <numeric>
RPL13AP17              NA
PNLIP                  NA
CST4         4.785417e-20
LOC100287718           NA
SEMG2                  NA
...                   ...
SNORD115-16            NA
SNORD115-31            NA
SNORD115-37            NA
SNORD115-39            NA
SNORD115-3             NA

Even though the padj became NA, I still get the same genes as top hits. What is more, I plotted the top upregulated genes after omitting the NAs, but my heatmap still doesnt show the differential gene expression between Mutant vs wt.How can I improve the heatmap ?

enter image description here

ADD REPLY • link 6.2 years ago by gokce.ouz ▴ 70

0

Entering edit mode

How are you plotting the heatmap? I bet the problem is with the plot, not the analysis.

This dataset seems to contain something between 50-100 samples, with this number of samples there is statistical power even for genes with low counts, so unless you are definitely not interested in low counts genes, i would keep them.

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon,

We are trying to find genes that are significantly differentially expressed to identify as signature genes so we need set of genes that show distinct expressional difference between Mutant vs WT.

There is 676 patients in that heatmap:

dim(assay(dds)
[1] 20502   676

This is how I plot the heatmap. I used both normTrnasform and vsd normalization but it did not change the heatmap:

select<- rownames(subset(res, padj < 0.001 & log2FoldChange >1))
ntd <- normTransform(dds)
df <- as.data.frame(colData(dds)[,c("mutvswt","IHC_HER2")])

pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE, annotation_col=df)

NTD Heatmap

vsd<- varianceStabilizingTransformation(dds, blind= FALSE)
df <- as.data.frame(colData(dds)[,c("mutvswt","IHC_HER2")])
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE, annotation_col=df)

VSD Heatmap

ADD REPLY • link 6.2 years ago by gokce.ouz ▴ 70

1

Entering edit mode

Try this:

select<- rownames( subset( res, padj < 0.001 & abs( log2FoldChange >1 ) ) )

ADD REPLY • link 6.2 years ago by h.mon 35k