Question

How can I produce gene level quantification using Salmon pseudo-aligner?

0

Entering edit mode

5.8 years ago

Angelique ▴ 10

Hi !

I am using Salmon in order to permform pseudo-alignment on paired end rna-seq data. I want a gene quantification but i obtain files cith transcripts quantification : command line used :

salmon quant -i Transcriptome_GH38_release_92/Homo_sapiens.GRCh38.92.cdna.ncrna.fa_quasi_index/ -l A -1 Test/SRX2264036_1.fastq.gz -2 Test/SRX2264036_2.fastq.gz -o test_quanti_36 -p 8

extract of obtained quantification file :

Name    Length  EffectiveLength TPM NumReads
ENST00000434970.2   9   5.093   0.000000    0.000000
ENST00000448914.1   13  6.885   0.000000    0.000000
ENST00000415118.1   8   4.489   0.000000    0.000000
ENST00000632684.1   12  6.443   0.000000    0.000000
ENST00000430425.1   17  9.050   0.000000    0.000000
ENST00000390578.1   31  15.313  0.000000    0.000000
ENST00000450276.1   17  9.050   0.000000    0.000000
ENST00000431870.1   16  8.504   0.000000    0.000000
ENST00000390567.1   20  10.664  0.000000    0.000000
ENST00000390590.1   31  15.313  0.000000    0.000000

I tried to used the -g option to provide a gtf annotation file but the resulting file is still at the transcript level.

How can I produce gene level quantification using Salmon ?

Thank you & Have a good day

Salmon quantification Gene RNA-Seq • 5.0k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 141k • written 5.8 years ago by Angelique ▴ 10

score 3 · Answer 1 · 2018-06-22

3

Entering edit mode

5.8 years ago

ATpoint 81k

Never used Salmon with -g but there is the tximport package to aggregate transcript quantifications to the gene level. Was developped for exactly this purpose.

ADD COMMENT • link 5.8 years ago by ATpoint 81k

score 0 · Answer 2 · 2018-06-23

It is also possible to get counts aggregated on the gene level with salmon directly. I am not aware of the exact command on the command line, since I run salmon on the Galaxy platform, but it is possible to provide a table matching each transcript to a gene and get a seperate output for counts at the transcript and the gene level.

Note that depending on which tool you will use for your downstream analysis you will need either TPM or raw counts.

score 0 · Answer 3 · 2021-11-23

Generate a simple tab deliminate text file in the following format:

transcript_id   gene_id
ENST00000456328.2   ENSG00000223972.5
ENST00000461467.1   ENSG00000237613.2

And then use it instead of your annotation file in -g option. You can use the folowing python code to convert gtf format (downloaded from GENCODE) to the mentioned format:

f_out=open(file='output file for salmon.txt',mode='w')
f_out.write('transcript_id\tgene_id\n')
with open(file='input file downloaded from GENCODE.gtf',mode='r') as f_in:
    for line in f_in:
        if(line[0]!='#'):
            id_column=line.split('\t')[8]
            gene_id=id_column.split(';')[0]
            tr_id=id_column.split(';')[1]
            if(('gene_id' in gene_id)&('transcript_id' in tr_id)):
                gene_id=gene_id.replace('gene_id', '')
                gene_id=gene_id.replace('"', '')
                gene_id=gene_id.strip()
                tr_id=tr_id.replace('transcript_id', '')
                tr_id=tr_id.replace('"', '')
                tr_id=tr_id.strip()
                f_out.write(tr_id+'\t'+gene_id+'\n')

f_out.close()