GTF based salmon index file for GRCh38
1
0
Entering edit mode
2.0 years ago

I am trying to create the salmon index for GRCh38 using Gencode v35. When I did the quantification, even though I added GTF file, "quant.genes.sf is only showing the Transcript id's not gene id's. Can you please tell me how to solve this issue? Is there a way to create a GTF based salmon index file for GRCh38? I am using salmon version 0.9.1.

salmon index -t /salmon/GRCh38/gencode.v35.pc_transcripts.fa -i /salmon/GRCh38/salmon_index --type quasi -k 31 --gencode

salmon quant -i /salmon/GRCh38/salmon_index -l A -1 ${FASTQ1} -2${FASTQ2} -o transcripts_quant -g \${GTF} --seqBias --validateMappings --useVBOpt --numBootstraps 100


Thanks Parvathi.

RNA-Seq Salmon index • 2.0k views
0
Entering edit mode

1) Sure, I will upgrade the tool, I was using the old version as it was already available in the HPC cluster.

2) I will use the full transcript fasta. I didn't completely understand the need for using decoy? If I am just looking for gene level and transcript level expression, is there any need to use decoy? Also, which GTF file, would be better to use from Gencode? gencode.v35.annotation.gtf, gencode.v35.basic.annotation.gtf or gencode.v35.primary_assemblyannotation.gtf When we make the Salmon index, is there a way to create a GTF based index? I read somewhere that for Kallisto, there is an option to create a GTF based index, but didn't see any tutorial for Salmon. Will using decoy help in this case?

3) I will use tximport.

Thanks Parvathi

2
Entering edit mode

2a) The decoy strategy aims to remove false-positive alignments/quantifications. The idea when e.g. using the whole genome as decoy is that if a given read better matches a sequence in the genome rather than a transcriptome then it is not counted in the transcriptome. This can account for genomic DNA contamination and random background transcription. It might be beneficial in some cases (check the recent salmon papers) but it is not strictly necessary. The main findings will probably be similar with and without decoy. It is a nice feature, I personally simply used the entire genome as decoy by adding the fasta file to the transcriptome with cat as described in the manual, but again, this is optional. If you find it too cumbersome then skip it.

2b) You only need the transcriptome for the indexing, not the GTF. I am not familiar with kallisto, cannot comment on it. In salmon you index the transcriptome and then later use e.g. tximport to summarize the transcript level counts to the gene level in case you want a gene level analysis e.g. with DESeq2 or edgeR. If you want transcript level differential analysis check e.g. the swith method from the fishpond paper.

0
Entering edit mode

I understood. Thank you!

0
Entering edit mode
1
Entering edit mode
2.0 years ago
ATpoint 65k

A couple of things:

1) v0.9.1 is very old (2017 I think), lots of improvements were made since then. Consider upgrading.

2) It is probably better to use the full transcript fasta file rather than only the protein-coding genes. Reason is straight-forward. If a gene that is not in the reference is transcriped then the tool will still try to find the best match in the reference. That will lead to false-positives. I routinely use the full (...).transcripts.fa.gz from Gencode. You can then later always filter for genes you are interested in, e.g. prior to differential analysis.

3) Cannot help with why the tool does not produce the file in the way you want but I would simply use the tximport package from Bioconductor to sum the counts from transcripts to the gene level. That is also recommended for downstream analysis, e.g. with edgeR or DESeq2, check its manual: https://bioconductor.org/packages/release/bioc/html/tximport.html

0
Entering edit mode

Sorry, I realized that I posted the reply under my initial question.

0
Entering edit mode

You posted it as a new answer, which I moved to a comment on the original question.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question