Gene types to use for RNA counts case-control
Entering edit mode
7 weeks ago
Kermit ▴ 80

The RNA-seq gene count TPM files that I have contain 60,000 genes. Below is a breakdown of their types.

In a case-control analysis, should I use either: (a) only the protein coding genes or (b) all of the gene types?

gene_type                           | count
protein_coding                        19962
lncRNA                                16901
processed_pseudogene                  10167
unprocessed_pseudogene                2614 
misc_RNA                              2212 
snRNA                                 1901 
miRNA                                 1881 
TEC                                   1057 
snoRNA                                943  
transcribed_unprocessed_pseudogene    939  
transcribed_processed_pseudogene      500  
rRNA_pseudogene                       497  
IG_V_pseudogene                       187  
IG_V_gene                             145  
transcribed_unitary_pseudogene        138  
TR_V_gene                             106  
unitary_pseudogene                    98   
TR_J_gene                             79   
scaRNA                                49   
polymorphic_pseudogene                48   
rRNA                                  47   
IG_D_gene                             37   
TR_V_pseudogene                       33   
Mt_tRNA                               22   
IG_J_gene                             18   
pseudogene                            18   
IG_C_gene                             14   
IG_C_pseudogene                       9    
ribozyme                              8    
TR_C_gene                             6    
sRNA                                  5    
TR_J_pseudogene                       4    
TR_D_gene                             4    
IG_J_pseudogene                       3    
Mt_rRNA                               2    
translated_processed_pseudogene       2    
vault_RNA                             1    
translated_unprocessed_pseudogene     1    
scRNA                                 1    
IG_pseudogene                         1
rna rnaseq tpm • 440 views
Entering edit mode
7 weeks ago

The correct yet complicated answer is that you should use only the features that have produced RNA in your data.

Of course, this also implies that you should remove transcripts that did not express in any of the conditions as those do not add information in any way.

But then figuring out what is not-expressed is only possible after attempting to classify against everything. So it is a chicken and egg problem of sorts.

To solve the conundrum, people usually start with known transcripts then filter out those transcripts that show no data in any of the conditions.

Entering edit mode

I didn't want to bias initial responses, but I saw this paper that said:

"For most RNA-Seq sequencing projects, only mRNAs are presumably enriched and sequenced, and there is no point in mapping sequence reads to RNAs such as miRNAs or lincRNAs.",correspond%20to%20protein%20coding%20genes.

Entering edit mode

Most RNAseq library prep protocols won't capture RNA shorter than 200nt. This would immediately mean that it shouldn't capture miRNA, snRNA or snoRNA (although you do sometimes get reads mapping to snRNA or snoRNA). Or some subsets of transcripts in each of the the rest of the categories.

Many RNA-seq experiments also enrich for poly-adenylated transcripts, so you should see few reads from transcripts that are no poly-adenylated. Most protein-coding genes are poly-adenylated (although there are a subset of histone genes that use a terminal triplex formation rather than a poly-A tail). miRNA, snoRNA, snRNA and rRNAs are definately not poly-adenylated. But for the rest of the categories - some are, some arn't. The link you posted is making the common, but incorrect assertion that lncRNAs are not poly-adenylated, which is incorrect. Some lncRNAs are poly-adenylated, others aren't.

In general, the only reason not to include genes in an analysis is the extra multiple testing burden in imposes: you are doing 60,000 tests rather than 20,000. In practice, I'm not sure how much difference that has every made to the overall conclusions of analyses i've done.

Entering edit mode

I am in the middle of writing a tutorial on this situation, where I observed upon analysis the number of differentially expressed genes changes by about 30%. It goes from 150 to about 200 after removing low expression genes.

I suspect that the effect of filtering is hard to predict when it comes to FDR since producing an FDR relies on the distribution of the p-values. Those p-values could vary more substantially for lowly expressed genes because the data is less reliable.

Entering edit mode

DESeq2 automatically removes low expression genes, no?

And it does that by picking the the expression threshold that maximises the number of genes passing the FDR threshold.

Entering edit mode

thank you! wow, so RNAseq is still relatively unexplored.

I suppose all "no poly-adenylated" transcripts would at least comparable to each other in that they are all unenriched. i normalize/scale all my features so that should help reduce the bias of poly-adenylation.

Entering edit mode

Its not that we don't know which transcripts are poly-adenylated and which arn't - its just that that information is not encoded in the gene/transcript biotype.


Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6