Question: GTF based salmon index file for GRCh38
gravatar for parvathi.sudha
3 months ago by
United States
parvathi.sudha20 wrote:

I am trying to create the salmon index for GRCh38 using Gencode v35. When I did the quantification, even though I added GTF file, "quant.genes.sf is only showing the Transcript id's not gene id's. Can you please tell me how to solve this issue? Is there a way to create a GTF based salmon index file for GRCh38? I am using salmon version 0.9.1.

salmon index -t /salmon/GRCh38/gencode.v35.pc_transcripts.fa -i /salmon/GRCh38/salmon_index --type quasi -k 31 --gencode

salmon quant -i /salmon/GRCh38/salmon_index -l A -1 ${FASTQ1} -2 ${FASTQ2} -o transcripts_quant -g ${GTF} --seqBias --validateMappings --useVBOpt --numBootstraps 100

Thanks Parvathi.

index rna-seq salmon • 379 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by parvathi.sudha20

Thank you for the reply.

1) Sure, I will upgrade the tool, I was using the old version as it was already available in the HPC cluster.

2) I will use the full transcript fasta. I didn't completely understand the need for using decoy? If I am just looking for gene level and transcript level expression, is there any need to use decoy? Also, which GTF file, would be better to use from Gencode? gencode.v35.annotation.gtf, gencode.v35.basic.annotation.gtf or gencode.v35.primary_assemblyannotation.gtf When we make the Salmon index, is there a way to create a GTF based index? I read somewhere that for Kallisto, there is an option to create a GTF based index, but didn't see any tutorial for Salmon. Will using decoy help in this case?

3) I will use tximport.

Thanks Parvathi

ADD REPLYlink written 3 months ago by parvathi.sudha20

For decoys see : C: How does salmon deal with decoy?

ADD REPLYlink written 3 months ago by GenoMax94k

2a) The decoy strategy aims to remove false-positive alignments/quantifications. The idea when e.g. using the whole genome as decoy is that if a given read better matches a sequence in the genome rather than a transcriptome then it is not counted in the transcriptome. This can account for genomic DNA contamination and random background transcription. It might be beneficial in some cases (check the recent salmon papers) but it is not strictly necessary. The main findings will probably be similar with and without decoy. It is a nice feature, I personally simply used the entire genome as decoy by adding the fasta file to the transcriptome with cat as described in the manual, but again, this is optional. If you find it too cumbersome then skip it.

2b) You only need the transcriptome for the indexing, not the GTF. I am not familiar with kallisto, cannot comment on it. In salmon you index the transcriptome and then later use e.g. tximport to summarize the transcript level counts to the gene level in case you want a gene level analysis e.g. with DESeq2 or edgeR. If you want transcript level differential analysis check e.g. the swith method from the fishpond paper.

ADD REPLYlink written 3 months ago by ATpoint44k

I understood. Thank you!

ADD REPLYlink written 3 months ago by parvathi.sudha20
gravatar for ATpoint
3 months ago by
ATpoint44k wrote:

A couple of things:

1) v0.9.1 is very old (2017 I think), lots of improvements were made since then. Consider upgrading.

2) It is probably better to use the full transcript fasta file rather than only the protein-coding genes. Reason is straight-forward. If a gene that is not in the reference is transcriped then the tool will still try to find the best match in the reference. That will lead to false-positives. I routinely use the full (...).transcripts.fa.gz from Gencode. You can then later always filter for genes you are interested in, e.g. prior to differential analysis.

3) Cannot help with why the tool does not produce the file in the way you want but I would simply use the tximport package from Bioconductor to sum the counts from transcripts to the gene level. That is also recommended for downstream analysis, e.g. with edgeR or DESeq2, check its manual:

ADD COMMENTlink modified 3 months ago • written 3 months ago by ATpoint44k

Sorry, I realized that I posted the reply under my initial question.

ADD REPLYlink written 3 months ago by parvathi.sudha20

You posted it as a new answer, which I moved to a comment on the original question.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question

ADD REPLYlink modified 3 months ago • written 3 months ago by GenoMax94k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour