I just started working with pseudoalignment tools and I now am trying salmon. I am trying to obtain a transcripts count table with RNA-seq samples.
I've built the index with transcriptome fasta provided by Ensembl, using the following command:
salmon index -t emsembl_human.fa -i ensembl_human
Then I used the quantification tool, with the following example:
salmon quant -p 12 -i ensembl_human --gcBias -o sample -1 sample_1.fa.gz -2 sample_2.fa.gz
After obtaining the count table, I've noticed that there are transcripts (with Ensembl Transcript ID) in my count table that are not present in the GTF file from the same source and version of the transcriptome. Many of those transcripts were already annotated by Ensembl and are present in Ensembl database if you make a quick query on their website. I may be missing something very obvious here, but I'm not understanding how salmon annotates those transcripts if this information is not present in the GTF. I am worried because I have about ~6k transcripts that only appear on the count table and are not in the GTF.
I would appreciate if someone clarifies this issue for me.
Thank you for helping me to clarify how those tools work. However, in the past few days I have performed the same process again, with the new version of the reference transcriptome (hg38 version 99 from Ensembl). After the quantification, I still observe that there are transcripts in transcriptome that are not present in the GTF file.
For example, the transcript "ENST00000632828" is present in transcriptome file, but it is not present in the GTF file of the same version.
Transcriptome link: ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
GTF link: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
I still have found thousands of those cases.
I was assuming that all transcripts in transcriptome would be annotated in the gtf file.
Did you notice that ENST00000632828 is located on a funky chromosome? That might be the source of the discrepancy.
I didn't. Thank you!
@lara this has been discussed/discovered for some time. See this twitter link.
That was helpful! Thanks!