Question

Need help understanding reference transcriptome and where to download

0

Entering edit mode

15 months ago

Daniel ▴ 30

Hello,

Apologies for a pretty elementary question. I tried my best to answer it using resources online but I find many tutorials/explanations out there difficult to understand.

I am trying to quantify human rnaseq data using salmon. The reason I am using salmon is because I would like to perform RNA isoform quantification. I am using salmon's mapping based mode (building an index and then quantifying).

I have already built an index and quantified samples, only to find that my quant.sf files are 639 rows long. I checked everything and noticed that the log file said:

[2023-01-23 11:44:52.070] [jointLog] [info] Index contained 639 targets.

Q1: Does this mean that my index only showed 639 total transcripts? Q2: Did this occur because I accidentally used GRCh38.p13.genome.fa (from gencode) instead of gencode.v42.transcripts.fa, or would this not be the reason?

I am currently re-running it with the latter file, but am not confident it will work. I am not sure if what I found is correctly "the reference transcriptome".

Q3: If not, could anyone advice where to download this file, and what the file exactly contains?

Q4:Would this file (or one suggested) be able to be used for finding transcripts from non-protein coding genes?

Thank you so much, and I'm sorry for all the questions. I have two more small questions, but please feel free ignoring these if you are busy:

My log file for index building (when using the former .genome.fa file had many of the following statements:

[2023-01-20 16:32:24.228] [puff::index::jointLog] [warning] Entry with header [GL000256.2] was longer than 400000 nucleotides. This is probably a chromosome instead of a transcript.

Could anyone advice on what they mean? I couldn't find information online or in salmon's handbook.

Thanks so so much.

salmon rnaseq index • 1.6k views

ADD COMMENT • link 15 months ago by Daniel ▴ 30

score 1 · Answer 1 · 2023-01-24

Hi, The index for Salmon is made using a fasta file containing the transcript sequences (and not chromosome sequences). So, genome fasta file is not correct. If it is the human genome you are interested in, you could download pre-built indices from here. It is worthwhile to read this again (if you did already). If you want build your own => On the Gencode page for the human current release, under the section for Fasta files, the 1st option is the fasta for the total set of 'known' transcripts. That is one option for reference transcriptome fasta. Another option could be the Ensembl provided transcriptome fasta. On this page, follow the link for download fasta under 'gene annotation' and on the FTP page, click the cDNA folder. The 'cdna.all.fa.gz' file would be the one (Check the Readme file).

For a well-annotated genome (human), the ref. transcriptome is the set of all known transcripts. Aligners like Salmon need a fasta of the transcript seq. The ref. transcriptome contains both protein-coding as well as non-coding transcripts. So you would be able to quantify non-coding too. It would also be worthwhile to go through the Gencode FAQ, to understand different terminologies involved, especially transcript support level (TSL).