Question

Picking the right genome annotation for RNA-seq

0

Entering edit mode

6.0 years ago

Thomas ▴ 30

Hi Everyone,

I am aligning raw RNA-seq data from mouse samples for downstream analysis of differentially expressed genes. I would like to do this entirely based on UCSC data, to avoid mix-ups in the nomenclature and to be able to view bedGraph files in the UCSC Genome Browser later on. I used HISAT2 for the alignment, using default parameters and the Mm10 genome indexed by HISAT2.

So here is my question: which genome annotation should I use for assembling/quantifying the expressed genes (StringTie)?

I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser, but I get extremely high rates of ambiguous gene mappings with both of these (55-65%, based on analysis with QoRTs). I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website, which yielded only around 10% ambiguous mappings.

Can anyone tell me which annotation people generally use for UCSC genes? Or is this an issue with the UCSC genome index from HISAT2?

I'd appreciate any feedback on this.

Thank you!

Thomas

RNA-Seq gtf annotation • 3.0k views

ADD COMMENT • link 6.0 years ago by Thomas ▴ 30

0

Entering edit mode

When you say:

I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website

Do you mean the same dataset, or a different dataset? Because the problem might be your data, not the annotations? Did you run this problematic dataset with the Ensembl genome + annotation?

ADD REPLY • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Sorry, I should have been more explicit there. Yes, I have run the same data set with Ensembl and got around 90% unique gene assignments. I should also mention that the actual alignment rate was virtually identical between the Ensembl and UCSC HISAT2 alignments, ranging from 75% to 88%. So I think the problem is probably with the annotation rather than the data set or the alignment, but I'm not sure.

ADD REPLY • link 6.0 years ago by Thomas ▴ 30

0

Entering edit mode

More questions, just to rule out some other possibilities: are you using a stranded library prep protocol? When running QoRTs, are you passing the --stranded flag?

I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser

Could you paste some lines of your UCSC gtf here (have you sorted it by position)? Maybe it has redundant annotations - same gene with different names. How did you create the gtf from UCSC?

I get updated annotations from UCSC using the procedures from this GenomeWiki page. In general, I use the Use genePredToGtf with a downloaded genePred table instructions. I never had this problem of ambiguous reads, but the organisms I work with have more sparsely (meaning worst) annotated genomes

ADD REPLY • link 6.0 years ago by h.mon 35k

score 1 · Answer 1 · 2018-05-07

1

Entering edit mode

6.0 years ago

lakhujanivijay 5.8k

Never had a hiccup with Ensembl data. Well documented, easy to access information is available on this page

ADD COMMENT • link 6.0 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Yes, the Ensembl annotations have worked well, but I can't use them to view the read counts as histograms in the UCSC Genome Browser, due to the formatting differences. Unless there is a way to do this? :-)