Question: How to match fasta header with the first column of a text file and append the second column of the text file to the end of the fasta header
0
gravatar for timmers
7 months ago by
timmers10
timmers10 wrote:

Aloha!

I have a fasta file that looks like the following:

>FFSA34B_100_M7_ID10014
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>ART01B_100_M7_ID10021
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>PHR01B_100_M7_ID100236
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG

I have a text file that looks like the following:

FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021 454
PHR01B_100_M7_ID100236 43

And I would like to append the second column of the text file to the matching fasta header to produce the following output:

>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG

The fasta headers and the text file in the real example will not be in the same order so the command will have to match the fasta header to the first column of the text file and then append the second column value to the fasta header. Anyone know of a why to easily do this?

Thanks in advance!

sequence • 463 views
ADD COMMENTlink modified 7 months ago by cpad011211k • written 7 months ago by timmers10

please, validate your previous questions: C: Inserting a string after placing the last underscore in a header if a fasta file ; Removing everything after the last underscore in the header of a fastafile

ADD REPLYlink modified 7 months ago • written 7 months ago by Pierre Lindenbaum120k
3
gravatar for Pierre Lindenbaum
7 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

linearize, sort and join (assuming delimiter is tab)

$ join -t $'\t' -1 1 -2 1 \
   <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort -t $'\t' -k1,1 list.txt) |\
   awk -F '\t' '{printf(">%s_%s\n%s\n",$1,$3,$2);}'

>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG
ADD COMMENTlink written 7 months ago by Pierre Lindenbaum120k

This is great if tab delimited! Question, how would you modify this if the text delimiter is a space?

ADD REPLYlink written 7 months ago by timmers10
2

Please run man join, man sort and man awk and look for the options to change delimiters. Or even easier, given that Pierre has explicitly mentioned those options and set the delimiter in all of his commands, read the commands carefully.

ADD REPLYlink modified 7 months ago • written 7 months ago by RamRS22k
0
gravatar for cpad0112
7 months ago by
cpad011211k
India
cpad011211k wrote:
$  awk '/>/ {getline seq} {sub (">","",$0);print $0, seq}' test.fa | sort -k1 | join -1 1 -2 1 - <(sort -k1 head.txt) -o 1.1,2.2,1.2 | awk '{print ">"$1"_"$2"\n"$3}'

>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG
ADD COMMENTlink modified 7 months ago • written 7 months ago by cpad011211k

How would you modify this command if the fasta is formatted as below to avoid returning just the first line of the sequence?

 >FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT
ADD REPLYlink modified 7 months ago • written 7 months ago by timmers10

I would linearize the fasta sequence. If that is not an option, one can use seqkit to linearize and then format it back to 60 chars per line.

code:

$ seqkit seq -w 0 test.fa | awk '/>/ {getline seq} {sub (">","",$0);print $0, seq}' | sort -k1 | join -1 1 -2 1 - <(sort -k1 ids.txt) -o 1.1,2.2,1.2| awk '{print ">"$1"_"$2"\n"$3}' | seqkit seq -w 60

output:

>ART01B_100_M7_ID10021_454
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATT
TAGCTGGGGTTTCAAGTATTCTTGGTGCTATTAATTTTATCACTACAACTATCAACATAC
GTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGTCTGGTCTGTTAAATTAACTG
CTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATC
AACATTTATTT
>FFSA34B_100_M7_ID10014_2
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGG
TTCAGTGGATTTGGTAATTTTTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGC
TATGAATTTTATTACGACAATAATAAATATGAGGGCGCCAGGGATAACTTTTGATAGAAT
GCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGA
TCCTGCAGGAGGAGGGGACCCCATCTTATATCAACATTTATTT
>PHR01B_100_M7_ID100236_43
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATT
TAGCGGGTATTTCTTCTCTATTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGC
GCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCCTTGATCCGTGCTAATTACTG
CAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACC
AGCATCTGTTT

Input:

$ cat test.fa 
>FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT

$ cat ids.txt 
FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021   454
PHR01B_100_M7_ID100236  43
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011211k

If you are okay with 2 step solution, here is one (a simpler one) with seqkit without linearization of sequence:

$ awk -v OFS="\t" '{print $1,$1"_"$2}' ids.txt > ids2.txt
$ seqkit replace  -p '(.+)$' -r '{kv}' test.fa -k ids2.txt -o test2.fa

output:

$ cat test2.fa 
>FFSA34B_100_M7_ID10014_2
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGG
TTCAGTGGATTTGGTAATTTTTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGC
TATGAATTTTATTACGACAATAATAAATATGAGGGCGCCAGGGATAACTTTTGATAGAAT
GCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGA
TCCTGCAGGAGGAGGGGACCCCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021_454
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATT
TAGCTGGGGTTTCAAGTATTCTTGGTGCTATTAATTTTATCACTACAACTATCAACATAC
GTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGTCTGGTCTGTTAAATTAACTG
CTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATC
AACATTTATTT
>PHR01B_100_M7_ID100236_43
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATT
TAGCGGGTATTTCTTCTCTATTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGC
GCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCCTTGATCCGTGCTAATTACTG
CAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACC
AGCATCTGTTT

$ cat ids2.txt 
FFSA34B_100_M7_ID10014  FFSA34B_100_M7_ID10014_2
ART01B_100_M7_ID10021   ART01B_100_M7_ID10021_454
PHR01B_100_M7_ID100236  PHR01B_100_M7_ID100236_43

input:

$ cat test.fa 
>FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT

$ cat ids.txt 
FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021   454
PHR01B_100_M7_ID100236  43
ADD REPLYlink written 7 months ago by cpad011211k

One step is enough:

seqkit replace -p '(.+)' -r '${1}_{kv}' -k ids.txt test.fa > out.fa

$ grep '>' out.fa 
>FFSA34B_100_M7_ID10014_2
>ART01B_100_M7_ID10021_454
>PHR01B_100_M7_ID100236_43
ADD REPLYlink written 6 months ago by shenwei3564.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour