Salmon annotation of transcripts in quasi-mapping step
1
1
Entering edit mode
4.2 years ago
Iara Souza ▴ 10

I just started working with pseudoalignment tools and I now am trying salmon. I am trying to obtain a transcripts count table with RNA-seq samples.

I've built the index with transcriptome fasta provided by Ensembl, using the following command:

salmon index -t emsembl_human.fa -i ensembl_human

Then I used the quantification tool, with the following example:

salmon quant -p 12 -i ensembl_human --gcBias -o sample -1 sample_1.fa.gz -2 sample_2.fa.gz

After obtaining the count table, I've noticed that there are transcripts (with Ensembl Transcript ID) in my count table that are not present in the GTF file from the same source and version of the transcriptome. Many of those transcripts were already annotated by Ensembl and are present in Ensembl database if you make a quick query on their website. I may be missing something very obvious here, but I'm not understanding how salmon annotates those transcripts if this information is not present in the GTF. I am worried because I have about ~6k transcripts that only appear on the count table and are not in the GTF.

I would appreciate if someone clarifies this issue for me.

salmon pseudoalignment mapping • 1.8k views
ADD COMMENT
0
Entering edit mode

Thank you for helping me to clarify how those tools work. However, in the past few days I have performed the same process again, with the new version of the reference transcriptome (hg38 version 99 from Ensembl). After the quantification, I still observe that there are transcripts in transcriptome that are not present in the GTF file.

For example, the transcript "ENST00000632828" is present in transcriptome file, but it is not present in the GTF file of the same version.

Transcriptome link: ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

GTF link: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz

I still have found thousands of those cases.

I was assuming that all transcripts in transcriptome would be annotated in the gtf file.

ADD REPLY
1
Entering edit mode

Did you notice that ENST00000632828 is located on a funky chromosome? That might be the source of the discrepancy.

ADD REPLY
0
Entering edit mode

I didn't. Thank you!

ADD REPLY
1
Entering edit mode

@lara this has been discussed/discovered for some time. See this twitter link.

ADD REPLY
0
Entering edit mode

That was helpful! Thanks!

ADD REPLY
4
Entering edit mode
4.2 years ago

Salmon doesn't annotate anything and never sees a GTF file. Those sequences are present in the fasta file you gave to salmon, so it's quantifying them. It sounds like you downloaded GTF and transcriptome fasta files from different Ensembl releases.

ADD COMMENT

Login before adding your answer.

Traffic: 2110 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6