I see many RNAseq library are prepared by polyA selection method, so while I analyze the differential gene expression, which annotation gene database is better to use. As I know, UCSC, RefSeq, GenCODE, etc has a list of RNA. But if I am comparing a library with polyA selection and another non-polyA selection library, I wonder if it is better to select a list of RNA with polyA as the annotation file. Thanks.
Yes, many RNA-seq libraries are created by first performing a polyA+ selection of total RNA. This has the effect of enriching for transcripts that are polyadenylated and therefore assumed to be enriched for mRNAs. Remember that after transcription of an immature RNA by RNA polymerase and processing by the splicing machinery, transcripts are polyadenylated and exported from the nucleus to the cytoplasm before translation of proteins from the mRNA template can occur.
Since total RNA is 95-98% ribosomal RNA (rRNA), and rRNAs are NOT polyadenylated, and RNA sequencing involves random sequencing of fragments, polyA selection is one method of preventing the situation where one is mostly sequencing the rRNAs to incredible depth and obtaining sequence from almost nothing else.
It is fairly common to consider the polyA+ genes to be the same set as the protein coding mRNAs. For a variety of reasons, in RNA-seq analysis, people sometimes do focus on the subset of genes that are protein coding. In Ensembl you can obtain this set of genes by identifying those that have the 'transcript biotype' of 'protein coding'.
For example, you can use Ensembl BioMART, after selecting species and database, and setting a filter: Gene type -> protein_coding.
22719 of 62252 human genes in the latest version of Ensembl are protein coding. With perhaps a few exceptions, all of these should be polyadenylated.
You can also obtain GTF files from the Ensembl FTP server. Within these files, again things like 'gene_biotype' are defined. You can therefore easily limit to particular types of genes such as 'protein_coding', 'miRNA', 'lincRNA', etc.. In order to do that you will need to understand how GTF files work.
For reference to some of the terms above refer to the following diagram: