Question

Add salmon quant.sf data to Trinity headers

0

Entering edit mode

18 months ago

kmkocot • 0

Hi all,

I routinely check my transcriptomes (from microscopic invertebrates) for exogenous contamination by blasting for COI and 18S and then looking to see if I get non-target hits and from what organism they came from. I'd like to establish a pipeline where I run salmon to get a quant.sf file for each assembly and then add the TPM and read count to the Trinity fasta header.

Here's what my Trinity fasta headers look like

TRINITY_DN8_c0_g1_i2
TRINITY_DN8_c0_g1_i1
TRINITY_DN8_c0_g1_i4
TRINITY_DN8_c0_g1_i3
TRINITY_DN8_c0_g1_i6
TRINITY_DN8_c0_g1_i7
TRINITY_DN18_c0_g1_i5
TRINITY_DN18_c0_g1_i2
TRINITY_DN18_c0_g1_i3
TRINITY_DN18_c0_g1_i6

Here's what the salmon quant.sf file looks like:

TRINITY_DN8_c0_g1_i2    510 285.832 7.037047    218.113
TRINITY_DN8_c0_g1_i1    1155    926.015 13.435710   1349.141
TRINITY_DN8_c0_g1_i4    1390    1161.015    1.133691    142.729
TRINITY_DN8_c0_g1_i3    698 469.444 12.193151   620.695
TRINITY_DN8_c0_g1_i6    563 336.582 6.744577    246.164
TRINITY_DN8_c0_g1_i7    1076    847.015 7.938752    729.159
TRINITY_DN18_c0_g1_i5   773 544.151 1.097282    64.747
TRINITY_DN18_c0_g1_i2   338 134.304 0.205994    3.000
TRINITY_DN18_c0_g1_i3   1452    1223.015    6.587060    873.579

I know there should be an easy script solution to add the last two fields from each line of the the quant.sf file to the corresponding fasta header but I haven't been able to figure this out. For what it's worth the is the exact same number of sequences in the fasta file as lines in the quant.sf file and they are in the same order.

Anyone who has figured this out or any scripting gurus who could lend some assistance would be greatly appreciated.

Thanks!

Kevin

salmon fasta header expression trinity • 438 views

ADD COMMENT • link updated 18 months ago by iraun 6.2k • written 18 months ago by kmkocot • 0

score 0 · Answer 1 · 2022-10-06

Hi! Would something like this work for you?

awk 'FNR==NR{split($0,a,"\t"); b[a[1]]=$4"\t"$5; next} NR>1 {sub(/>/,"",$0); if ($0 in b) { print ">"$0"\t"b[$0] } else { print } }' quant.sf fasta.fa

Explanation of the main sections of the command:

Parse quant.sf file and create a dictionary with the first field (sequence ID) as key and the last two fields as value.

awk 'FNR==NR{split($0,a,"\t"); b[a[1]]=$4"\t"$5; next}

Parse the fasta file, extract the header ID and extract the corresponding values of that ID in the previously generated dictionary

NR>1 {sub(/>/,"",$0); if ($0 in b) { print ">"$0"\t"b[$0] } else { print } }'