warning when creating salmon index from GENCODE transcriptome
1
0
Entering edit mode
21 months ago
vaushev ▴ 10

I was trying to create an index for salmon tool (v 1.0), using the following command:

salmon index -t gencode.v33.transcripts.fa -i index_salmon_v33


It seems to work but gives a warning:

[puff::index::jointLog] [warning] It appears that this may be a GENCODE transcriptome (from analyzing the separators in the FASTA header). However, you have not set '|' as a header separator. If this is a GENCODE transcriptome, consider passing --gencode to the pufferfish index command.

Hence, I started wondering if it is normal at all to use gencode transcriptome, or it's not the default recommended way? I found that this warning comes not from salmon itself but from pufferfish module - but I don't know if it's important, or I should just ignore it.

salmon RNA-Seq • 1.3k views
0
Entering edit mode

It is generally just a weird thing where the "format of FASTA headers" is always just some gobbledegook freeform text basically, so it is not generally a weird thing to "suggest special handling" like this, it is actually nice. If you want to see about the effect of reference genome on your analysis see some resources like https://books.google.com/books?hl=en&lr=&id=TGqQDwAAQBAJ&oi=fnd&pg=PA427&ots=hJiXFaFsat&sig=7e5kz312_DPIJsYosAGAtfIIwls#v=onepage&q&f=false https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference

2
Entering edit mode
21 months ago
Rob 5.0k

The reason salmon gives this message is because the Gencode transcripts have (for the purposes most people want to use them for) unnecessarily long and convoluted names. Specifically, there is the transcript id, and then a considerable amount of metadata encoded in the record name, separated by |. If you use these names for quantification, you end up with quantification files that contain these absurdly large names in each quant.sf file. In fact, because Gencode is such a common source of transcriptomes, Salmon supports special handling of Gencode transcriptomes where transcript record names are truncated at the first |. This means that in all of the quantification files generated via the index, the quant.sf files will just contain the transcript ids, and not the long names that are encoded in the input fasta file. Thus, unless you have a good reason not to, you can just pass --gencode to the index command to enable this functionality.

0
Entering edit mode

Thank you Rob! I didn't realize I could add the --gencode option directly to salmon call (it's not described in salmon docs)