Question: Filtering Low Count Genes
0
gravatar for gokce.ouz
21 months ago by
gokce.ouz50
Singapore
gokce.ouz50 wrote:

Hi All,

I am working on TCGA Breast Cancer dataset. I downloaded the Count data using TCGA2STAT.

counts.brca <- getTCGA(disease="BRCA", data.type="RNASeq", type="count")

Extracted the mutation data and clinical data using library("cgdsr")

For downstream analysis I am using DESeq2.

ddsMat<- DESeqDataSetFromMatrix(tcga_comb, sampletable, design=~mut.status)
dds<-DESeq(ddsMat)
dds<-DESeq(dds)
res<-results(dds, contrast= c("mutvswt", "Mut", "WT"))
resOrdered <- res[order(res$log2FoldChange,decreasing=TRUE),]
resOrdered   
log2 fold change (MLE): mutvswt Mut vs WT 
Wald test p-value: mutvswt Mut vs WT 
DataFrame with 20502 rows and 6 columns
               baseMean log2FoldChange     lfcSE      stat       pvalue
              <numeric>      <numeric> <numeric> <numeric>    <numeric>
RPL13AP17      3.544641       2.310630 0.3485938  6.628432 3.392722e-11
PNLIP          1.872278       1.916453 0.4180853  4.583879 4.564283e-06
CST4         616.052645       1.806587 0.1851868  9.755485 1.747628e-22
LOC100287718   2.395814       1.653026 0.3204551  5.158369 2.491100e-07
SEMG2          1.345911       1.647315 0.4141256  3.977814 6.955174e-05
                      padj
                <numeric>
RPL13AP17    1.446313e-09
PNLIP        4.743272e-05
CST4         4.715005e-20
LOC100287718 3.851038e-06
SEMG2        4.909754e-04

However, when I plotted heatmap of the upregulated genes. I did not see clear clustering between Mutant vs WT so I checked the raw count of top upregulated gene(RPL13AP17), I observed only 4 patients have expression higher than 1000 and the rest is super low which I think shouldn't pop-up as significantly upregulated gene. Also the 2nd top upregulated gene (PNLIP) was similar.

select<- rownames(subset(res, padj < 0.001 & log2FoldChange >1))
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE)

summary(res)
        out of 20490 with nonzero total read count
    adjusted p-value < 0.1
    LFC > 0 (up)     : 1906, 9.3% 
    LFC < 0 (down)   : 6424, 31% 
    outliers [1]     : 0, 0% 
    low counts [2]   : 795, 3.9% 
    (mean count < 0)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

table(assay(dds)['RPL13AP17',])

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 452   73   75   16   19    2    4    2    4    2    1    2    2    1    1    2 
  19   25   26   27   47   89   93  132  228  369  488  577  731 1737 1754 1983 
   1    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
6852 
   1 
table(assay(dds)['PNLIP',])

  0   1   2   3   4   5   6   7   8   9  10  12  13  15  16  17  18  20  22  25 
526  29  50   4  13   1   9   1   4   1   3   5   2   2   1   1   2   2   1   1 
 26  30  36  42  45  50  51  60  70  73  76  79  83 110 111 131 423 
  2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

I thought independent filtering was taking care of these low count genes in results() function. I will be really glad if you could suggest me how can I remove these genes to get the distinctly differentially expressed genes.

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

Matrix products: default
BLAS: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRblas.so
LAPACK: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hexbin_1.27.1              TCGA2STAT_1.2             
 [3] pheatmap_1.0.8             genefilter_1.58.1         
 [5] gplots_3.0.1               RColorBrewer_1.1-2        
 [7] vsn_3.44.0                 org.Hs.eg.db_3.4.1        
 [9] DESeq2_1.16.1              BiocParallel_1.10.1       
[11] GenomicAlignments_1.12.2   SummarizedExperiment_1.6.5
[13] DelayedArray_0.2.7         matrixStats_0.52.2        
[15] GenomicFeatures_1.28.5     AnnotationDbi_1.38.2      
[17] Biobase_2.36.2             Rsamtools_1.28.0          
[19] Biostrings_2.44.2          XVector_0.16.0            
[21] GenomicRanges_1.28.5       GenomeInfoDb_1.12.2       
[23] IRanges_2.10.3             S4Vectors_0.14.5          
[25] BiocGenerics_0.22.0        BiocInstaller_1.28.0      

loaded via a namespace (and not attached):
 [1] bit64_0.9-7             splines_3.4.1           gtools_3.5.0           
 [4] Formula_1.2-2           affy_1.54.0             latticeExtra_0.6-28    
 [7] blob_1.1.0              GenomeInfoDbData_0.99.0 RSQLite_2.0            
[10] backports_1.1.1         lattice_0.20-35         limma_3.32.7           
[13] digest_0.6.12           checkmate_1.8.4         colorspace_1.3-2       
[16] preprocessCore_1.38.1   htmltools_0.3.6         Matrix_1.2-11          
[19] R.oo_1.21.0             plyr_1.8.4              pkgconfig_2.0.1        
[22] XML_3.98-1.9            biomaRt_2.32.1          zlibbioc_1.22.0        
[25] xtable_1.8-2            scales_0.5.0            gdata_2.18.0           
[28] affyio_1.46.0           htmlTable_1.9           tibble_1.3.4           
[31] annotate_1.54.0         ggplot2_2.2.1           nnet_7.3-12            
[34] lazyeval_0.2.0          survival_2.41-3         magrittr_1.5           
[37] memoise_1.1.0           R.methodsS3_1.7.1       foreign_0.8-69         
[40] tools_3.4.1             data.table_1.10.4       stringr_1.2.0          
[43] munsell_0.4.3           locfit_1.5-9.1          cluster_2.0.6          
[46] compiler_3.4.1          caTools_1.17.1          rlang_0.1.2            
[49] grid_3.4.1              RCurl_1.95-4.8          htmlwidgets_0.9        
[52] labeling_0.3            bitops_1.0-6            base64enc_0.1-3        
[55] gtable_0.2.0            DBI_0.7                 gridExtra_2.3          
[58] knitr_1.17              rtracklayer_1.36.4      bit_1.1-12             
[61] Hmisc_4.0-3             KernSmooth_2.23-15      stringi_1.1.5          
[64] Rcpp_0.12.13            geneplotter_1.54.0      rpart_4.1-11           
[67] acepack_1.4.1
rna-seq deseq2 R tcga2stat • 1.3k views
ADD COMMENTlink written 21 months ago by gokce.ouz50

Try basing the independent filtering on the median count rather than mean or sum.

ADD REPLYlink written 21 months ago by Devon Ryan92k

Hi Devon,

Thanks for your reply. I tried your suggestion, but I have doubts about my method. I made the change at the results() function as :

res<-results(dds, contrast= c("mutvswt", "Mut", "WT"), filter= rowMedians(counts(dds)))

Is this the correct approach ?

res[order(res$log2FoldChange, res$padj, decreasing=TRUE), ]
log2 fold change (MLE): mutvswt Mut vs WT 
Wald test p-value: mutvswt Mut vs WT 
DataFrame with 20502 rows and 6 columns
               baseMean log2FoldChange     lfcSE      stat       pvalue
              <numeric>      <numeric> <numeric> <numeric>    <numeric>
RPL13AP17      3.544641       2.310630 0.3485938  6.628432 3.392722e-11
PNLIP          1.872278       1.916453 0.4180853  4.583879 4.564283e-06
CST4         616.052645       1.806587 0.1851868  9.755485 1.747628e-22
LOC100287718   2.395814       1.653026 0.3204551  5.158369 2.491100e-07
SEMG2          1.345911       1.647315 0.4141256  3.977814 6.955174e-05
...                 ...            ...       ...       ...          ...
SNORD115-16           0             NA        NA        NA           NA
SNORD115-31           0             NA        NA        NA           NA
SNORD115-37           0             NA        NA        NA           NA
SNORD115-39           0             NA        NA        NA           NA
SNORD115-3            0             NA        NA        NA           NA
                     padj
                <numeric>
RPL13AP17              NA
PNLIP                  NA
CST4         4.785417e-20
LOC100287718           NA
SEMG2                  NA
...                   ...
SNORD115-16            NA
SNORD115-31            NA
SNORD115-37            NA
SNORD115-39            NA
SNORD115-3             NA

Even though the padj became NA, I still get the same genes as top hits. What is more, I plotted the top upregulated genes after omitting the NAs, but my heatmap still doesnt show the differential gene expression between Mutant vs wt.How can I improve the heatmap ?

enter image description here

ADD REPLYlink modified 21 months ago • written 21 months ago by gokce.ouz50

How are you plotting the heatmap? I bet the problem is with the plot, not the analysis.

This dataset seems to contain something between 50-100 samples, with this number of samples there is statistical power even for genes with low counts, so unless you are definitely not interested in low counts genes, i would keep them.

ADD REPLYlink written 21 months ago by h.mon28k

Hi h.mon,

We are trying to find genes that are significantly differentially expressed to identify as signature genes so we need set of genes that show distinct expressional difference between Mutant vs WT.

There is 676 patients in that heatmap:

dim(assay(dds)
[1] 20502   676

This is how I plot the heatmap. I used both normTrnasform and vsd normalization but it did not change the heatmap:

select<- rownames(subset(res, padj < 0.001 & log2FoldChange >1))
ntd <- normTransform(dds)
df <- as.data.frame(colData(dds)[,c("mutvswt","IHC_HER2")])

pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE, annotation_col=df)

NTD Heatmap

vsd<- varianceStabilizingTransformation(dds, blind= FALSE)
df <- as.data.frame(colData(dds)[,c("mutvswt","IHC_HER2")])
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE, annotation_col=df)

VSD Heatmap

ADD REPLYlink modified 21 months ago • written 21 months ago by gokce.ouz50

Try this:

select<- rownames( subset( res, padj < 0.001 & abs( log2FoldChange >1 ) ) )
ADD REPLYlink written 21 months ago by h.mon28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour