Question

Enough counts for RNA-Seq differential expression analysis

0

Entering edit mode

7.2 years ago

skhan ▴ 10

I have 2x75b TruSeq RNA-Seq data (paired end, stranded) collected on an Illumina instrument, aligned with STAR, and counted with htseq-count (which agreed with STAR's --quantMode GeneCounts option). These are rat samples (2 conditions, three biological replicates), which has ~32,754 genes. For one of my samples, here is a binned list of raw (htseq-count) counts by gene:

Counts             Genes
0                 19,136
1-10               3,699
10-100             3,722
100-1,000          3,784
1,000-10,000       2,089
10,000-100,000       309
100,000-1,000,000     15
1,000,000+             0

Total:
20,399,575        32,754

As you can see, only ~2,400 genes have counts of 1,000 or more. Do I have enough data to perform differential expression analysis with confidence? What counts cutoff, if any, do you use for DE analysis?

As an example, for one specific gene across six samples (first three control, next three test condition) I have counts of 4, 8, 6, 53, 78, 216 and, after normalization, an adjusted p value of 1.28E-06 indicating differential expression at a log2FC of 3.703.

RNA-Seq htseq-count • 2.3k views

ADD COMMENT • link 7.2 years ago by skhan ▴ 10

2

Entering edit mode

Can you tell us how many reads are in each sample? What was the alignment percentage?

Generally for eukaryotic data 25-30 M reads per sample are enough for doing DE analysis.

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

I have nine samples covering three conditions. Here are the raw counts from htseq-count, as well as normalized counts from SARTools' DESeq2 pipeline:

Sample     STAR Alignment %        Raw_Counts   Normalized_Counts
        Unique  Multiple    Sum                               
  c1     90.64    8.17     98.81   20,399,575       39,052,552
  c2     90.52    8.72     99.24   25,728,157       18,218,156
  c3     71.63    6.53     78.16   20,115,612       22,051,132
  h1     90.27    8.79     99.06   22,376,562       34,936,802
  h2     90.90    8.20     99.10   19,079,631       17,365,524
  h3     91.52    7.69     99.21   31,587,549       14,407,188
  l1     90.05    8.92     98.97   28,416,462       30,317,636
  l2     90.77    8.43     99.20   35,559,866       19,150,777
  l3     90.69    8.58     99.27   14,865,923       27,040,614

The details in my original post is for sample c1.

ADD REPLY • link 7.2 years ago by skhan ▴ 10

1

Entering edit mode

Raw_Counts = Aligned reads or Aligned Fragments? (does that number include multi-mappers which are about 9% for most)?

You have enough data to try the DE analysis.

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

Here is a more detailed summary:

                      STAR alignment values                                                          htseq-count values
         -------------------------------------------------      --------------------------------------------------------------------------------------------------------            
Sample   Input reads    Uniquely mapped    Reads mapped to         Reads       __no_feature    __ambiguous    __too_low_aQual    __not_aligned    __alignment_not_unique
                             reads          multiple loci       (Raw_Counts)                                                                                                      

  c1      37,435,898       33,931,305         3,060,283          20,399,575      13,498,699        27,517           5,514             444,310             7,026,535 
  c2      44,015,057       39,843,026         3,838,719          25,728,157      14,060,510        52,531           1,828             333,312             9,084,619 
  c3      48,208,727       34,534,214         3,149,229          20,115,612      14,380,428        36,582           1,592          10,525,284             7,622,347 
  h1      38,138,800       34,427,481         3,350,713          22,376,562      12,012,882        34,054           3,983             360,606             7,664,658 
  h2      35,886,906       32,622,424         2,941,698          19,079,631      13,502,657        38,435           1,701             322,784             6,999,292 
  h3      51,374,736       47,020,081         3,953,013          31,587,549      15,360,981        68,648           2,903             401,642             9,513,155 
  l1      48,526,339       43,695,810         4,326,623          28,416,462      15,224,382        49,556           5,410             503,906             9,910,686 
  l2      62,271,697       56,523,433         5,247,441          35,559,866      20,888,404        72,072           3,091             500,823            12,373,703 
  l3      27,158,846       24,629,495         2,331,214          14,865,923       9,737,646        24,407           1,519             198,137             5,388,878

If I have done what I think I have done (I used htseq-count's --mode intersection-strict), then Raw_Counts = Aligned reads and does not include multi mappers (multi mappers would be in the --ambiguous column).

ADD REPLY • link 7.2 years ago by skhan ▴ 10

1

Entering edit mode

not directly related to your question but I would think there are more (protein coding) exons in the rat genome. How do you get to that number?

ADD REPLY • link 7.2 years ago by lieven.sterck 15k

0

Entering edit mode

Here's a quick check I just performed:

cd ${preferred_directory}
wget ftp://ftp.ensembl.org/pub/release-86/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.86.gtf.gz
awk '$3 == "gene"' <(gzip -dc Rattus_norvegicus.Rnor_6.0.86.gtf.gz) | wc -l
# output is 32754

32,754 refers to genes I am looking at, not exons. I've adjusted my question.

ADD REPLY • link 7.2 years ago by skhan ▴ 10

0

Entering edit mode

OK, that sounds more like it indeed. thx

ADD REPLY • link 7.2 years ago by lieven.sterck 15k