Gene types to use for RNA counts case-control
1
0
Entering edit mode
7 weeks ago
Kermit ▴ 80

The RNA-seq gene count TPM files that I have contain 60,000 genes. Below is a breakdown of their types.

In a case-control analysis, should I use either: (a) only the protein coding genes or (b) all of the gene types?

gene_type                           | count
------------------------------------+----------------
protein_coding                        19962
lncRNA                                16901
processed_pseudogene                  10167
unprocessed_pseudogene                2614
misc_RNA                              2212
snRNA                                 1901
miRNA                                 1881
TEC                                   1057
snoRNA                                943
transcribed_unprocessed_pseudogene    939
transcribed_processed_pseudogene      500
rRNA_pseudogene                       497
IG_V_pseudogene                       187
IG_V_gene                             145
transcribed_unitary_pseudogene        138
TR_V_gene                             106
unitary_pseudogene                    98
TR_J_gene                             79
scaRNA                                49
polymorphic_pseudogene                48
rRNA                                  47
IG_D_gene                             37
TR_V_pseudogene                       33
Mt_tRNA                               22
IG_J_gene                             18
pseudogene                            18
IG_C_gene                             14
IG_C_pseudogene                       9
ribozyme                              8
TR_C_gene                             6
sRNA                                  5
TR_J_pseudogene                       4
TR_D_gene                             4
IG_J_pseudogene                       3
Mt_rRNA                               2
translated_processed_pseudogene       2
vault_RNA                             1
translated_unprocessed_pseudogene     1
scRNA                                 1
IG_pseudogene                         1

rna rnaseq tpm • 440 views
0
Entering edit mode
7 weeks ago

The correct yet complicated answer is that you should use only the features that have produced RNA in your data.

Of course, this also implies that you should remove transcripts that did not express in any of the conditions as those do not add information in any way.

But then figuring out what is not-expressed is only possible after attempting to classify against everything. So it is a chicken and egg problem of sorts.

To solve the conundrum, people usually start with known transcripts then filter out those transcripts that show no data in any of the conditions.

0
Entering edit mode

I didn't want to bias initial responses, but I saw this paper that said:

"For most RNA-Seq sequencing projects, only mRNAs are presumably enriched and sequenced, and there is no point in mapping sequence reads to RNAs such as miRNAs or lincRNAs."

1
Entering edit mode

Most RNAseq library prep protocols won't capture RNA shorter than 200nt. This would immediately mean that it shouldn't capture miRNA, snRNA or snoRNA (although you do sometimes get reads mapping to snRNA or snoRNA). Or some subsets of transcripts in each of the the rest of the categories.

Many RNA-seq experiments also enrich for poly-adenylated transcripts, so you should see few reads from transcripts that are no poly-adenylated. Most protein-coding genes are poly-adenylated (although there are a subset of histone genes that use a terminal triplex formation rather than a poly-A tail). miRNA, snoRNA, snRNA and rRNAs are definately not poly-adenylated. But for the rest of the categories - some are, some arn't. The link you posted is making the common, but incorrect assertion that lncRNAs are not poly-adenylated, which is incorrect. Some lncRNAs are poly-adenylated, others aren't.

In general, the only reason not to include genes in an analysis is the extra multiple testing burden in imposes: you are doing 60,000 tests rather than 20,000. In practice, I'm not sure how much difference that has every made to the overall conclusions of analyses i've done.

1
Entering edit mode

I am in the middle of writing a tutorial on this situation, where I observed upon analysis the number of differentially expressed genes changes by about 30%. It goes from 150 to about 200 after removing low expression genes.

I suspect that the effect of filtering is hard to predict when it comes to FDR since producing an FDR relies on the distribution of the p-values. Those p-values could vary more substantially for lowly expressed genes because the data is less reliable.

0
Entering edit mode

DESeq2 automatically removes low expression genes, no?

And it does that by picking the the expression threshold that maximises the number of genes passing the FDR threshold.

0
Entering edit mode

thank you! wow, so RNAseq is still relatively unexplored.

I suppose all "no poly-adenylated" transcripts would at least comparable to each other in that they are all unenriched. i normalize/scale all my features so that should help reduce the bias of poly-adenylation.

0
Entering edit mode

Its not that we don't know which transcripts are poly-adenylated and which arn't - its just that that information is not encoded in the gene/transcript biotype.

Traffic: 1951 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.