Question: Salmon annotation of transcripts in quasi-mapping step
0
gravatar for Iara Souza
12 months ago by
Iara Souza0
Brazil
Iara Souza0 wrote:

I just started working with pseudoalignment tools and I now am trying salmon. I am trying to obtain a transcripts count table with RNA-seq samples.

I've built the index with transcriptome fasta provided by Ensembl, using the following command:

salmon index -t emsembl_human.fa -i ensembl_human

Then I used the quantification tool, with the following example:

salmon quant -p 12 -i ensembl_human --gcBias -o sample -1 sample_1.fa.gz -2 sample_2.fa.gz

After obtaining the count table, I've noticed that there are transcripts (with Ensembl Transcript ID) in my count table that are not present in the GTF file from the same source and version of the transcriptome. Many of those transcripts were already annotated by Ensembl and are present in Ensembl database if you make a quick query on their website. I may be missing something very obvious here, but I'm not understanding how salmon annotates those transcripts if this information is not present in the GTF. I am worried because I have about ~6k transcripts that only appear on the count table and are not in the GTF.

I would appreciate if someone clarifies this issue for me.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Iara Souza0

Thank you for helping me to clarify how those tools work. However, in the past few days I have performed the same process again, with the new version of the reference transcriptome (hg38 version 99 from Ensembl). After the quantification, I still observe that there are transcripts in transcriptome that are not present in the GTF file.

For example, the transcript "ENST00000632828" is present in transcriptome file, but it is not present in the GTF file of the same version.

Transcriptome link: ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

GTF link: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz

I still have found thousands of those cases.

I was assuming that all transcripts in transcriptome would be annotated in the gtf file.

ADD REPLYlink written 12 months ago by Iara Souza0
1

Did you notice that ENST00000632828 is located on a funky chromosome? That might be the source of the discrepancy.

ADD REPLYlink written 12 months ago by swbarnes29.6k

I didn't. Thank you!

ADD REPLYlink written 12 months ago by Iara Souza0
1

@lara this has been discussed/discovered for some time. See this twitter link.

ADD REPLYlink modified 12 months ago • written 12 months ago by GenoMax96k

That was helpful! Thanks!

ADD REPLYlink written 12 months ago by Iara Souza0
4
gravatar for Devon Ryan
12 months ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

Salmon doesn't annotate anything and never sees a GTF file. Those sequences are present in the fasta file you gave to salmon, so it's quantifying them. It sounds like you downloaded GTF and transcriptome fasta files from different Ensembl releases.

ADD COMMENTlink written 12 months ago by Devon Ryan98k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1076 users visited in the last hour
_