bioawk looping to retrieve fasta sequences by seq ID pattern
1
2
Entering edit mode
2.8 years ago
shaharbra ▴ 20

Hi, I have a list of partial seq IDs, and a fasta file, and I want to get a fasta file that contains all the sequences that match the IDs in the list.

Example:

list_ID.txt

AF306937.1
S56354.1

Requested output: proteins_output.fasta

lcl|AF306937.1_prot_AAG42454.1_1 MGMDPIALQAGFDLLGDGRPETLWLGIGTLLMIIGTFYFIAQGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGAVLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAQRSEEVQSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGVETLAFMVLDVTAKVGFGFALLRSRAILGETEAPEPSAGT

lcl|S56354.1_prot_AAB19870.2_1 MDPIALQAGFDLLNDGRPETLWLGIGTLLMLIGTFYFIARGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGTVLDIYYARYADWLFTTPLLLLDLALLAKVDRVTIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAKRSEEVRSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGIETLAFMVLDVTAKVGFGFVLLRSRAILGETEAPEPSAGADASAAD

I wrote this loop for bioawk, but it isn't quite working as I want.

When I run this loop, I get an empty "proteins_outout.fasta" file:

while read p; do
   bioawk -c fastx '$name ~ /$p/ { print ">"$name"\n"$seq"\n"; }' proteins_db.fasta > proteins_outout.fasta; 
done <list_ID.txt

Although when I try to run the bioawk line for a single seq ID, I do get the right output:

bioawk -c fastx '$name ~ /AF306937.1/ { print ">"$name"\n"$seq"\n"; }'
proteins_db.fasta

lcl|AF306937.1_prot_AAG42454.1_1 MGMDPIALQAGFDLLGDGRPETLWLGIGTLLMIIGTFYFIAQGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGAVLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAQRSEEVQSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGVETLAFMVLDVTAKVGFGFALLRSRAILGETEAPEPSAGT

Why is my loop not working as I want?

Would appreciate any help!

Thanks

regex fasta bioawk loop • 1.4k views
ADD COMMENT
2
Entering edit mode
2.8 years ago
ATpoint 82k

I think a) you have to pass the shell variable $p to bioawk properly using -v and b) you are currently overwriting your output file in every iteration as you use > rather than >>:

while read p
    do
    bioawk -v var="${p}" -c fastx '$name ~ var {print ">"$name"\n"$seq}' proteins_db.fasta >> proteins_outout.fasta
    done < list_ID.txt
ADD COMMENT
0
Entering edit mode

Yessssss thank you! I didn't know about -v for passing shell variables to bioawk. Working perfectly now!

ADD REPLY
1
Entering edit mode

Cool! The -v is actually from "standard" awk, and afaik bioawk can do all awk things plus bioinformatics-specific ones.

ADD REPLY

Login before adding your answer.

Traffic: 2611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6