Question: Picking the right genome annotation for RNA-seq
0
gravatar for Thomas
18 months ago by
Thomas30
Columbia University
Thomas30 wrote:

Hi Everyone,

I am aligning raw RNA-seq data from mouse samples for downstream analysis of differentially expressed genes. I would like to do this entirely based on UCSC data, to avoid mix-ups in the nomenclature and to be able to view bedGraph files in the UCSC Genome Browser later on. I used HISAT2 for the alignment, using default parameters and the Mm10 genome indexed by HISAT2.

So here is my question: which genome annotation should I use for assembling/quantifying the expressed genes (StringTie)?

I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser, but I get extremely high rates of ambiguous gene mappings with both of these (55-65%, based on analysis with QoRTs). I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website, which yielded only around 10% ambiguous mappings.

Can anyone tell me which annotation people generally use for UCSC genes? Or is this an issue with the UCSC genome index from HISAT2?

I'd appreciate any feedback on this.

Thank you!

Thomas

rna-seq annotation gtf • 1.1k views
ADD COMMENTlink modified 18 months ago • written 18 months ago by Thomas30

When you say:

I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website

Do you mean the same dataset, or a different dataset? Because the problem might be your data, not the annotations? Did you run this problematic dataset with the Ensembl genome + annotation?

ADD REPLYlink written 18 months ago by h.mon28k

Sorry, I should have been more explicit there. Yes, I have run the same data set with Ensembl and got around 90% unique gene assignments. I should also mention that the actual alignment rate was virtually identical between the Ensembl and UCSC HISAT2 alignments, ranging from 75% to 88%. So I think the problem is probably with the annotation rather than the data set or the alignment, but I'm not sure.

ADD REPLYlink written 18 months ago by Thomas30

More questions, just to rule out some other possibilities: are you using a stranded library prep protocol? When running QoRTs, are you passing the --stranded flag?

I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser

Could you paste some lines of your UCSC gtf here (have you sorted it by position)? Maybe it has redundant annotations - same gene with different names. How did you create the gtf from UCSC?

I get updated annotations from UCSC using the procedures from this GenomeWiki page. In general, I use the Use genePredToGtf with a downloaded genePred table instructions. I never had this problem of ambiguous reads, but the organisms I work with have more sparsely (meaning worst) annotated genomes

ADD REPLYlink modified 18 months ago • written 18 months ago by h.mon28k
1
gravatar for lakhujanivijay
18 months ago by
lakhujanivijay4.5k
India
lakhujanivijay4.5k wrote:

Never had a hiccup with Ensembl data. Well documented, easy to access information is available on this page

ADD COMMENTlink written 18 months ago by lakhujanivijay4.5k

Yes, the Ensembl annotations have worked well, but I can't use them to view the read counts as histograms in the UCSC Genome Browser, due to the formatting differences. Unless there is a way to do this? :-)

ADD REPLYlink written 18 months ago by Thomas30
1

Ensembl allows you to use custom tracks just like UCSC.

ADD REPLYlink written 18 months ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1311 users visited in the last hour