Hi, I have a list of partial seq IDs, and a fasta file, and I want to get a fasta file that contains all the sequences that match the IDs in the list.
Example:
list_ID.txt
AF306937.1 S56354.1
Requested output: proteins_output.fasta
lcl|AF306937.1_prot_AAG42454.1_1 MGMDPIALQAGFDLLGDGRPETLWLGIGTLLMIIGTFYFIAQGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGAVLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAQRSEEVQSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGVETLAFMVLDVTAKVGFGFALLRSRAILGETEAPEPSAGT
lcl|S56354.1_prot_AAB19870.2_1 MDPIALQAGFDLLNDGRPETLWLGIGTLLMLIGTFYFIARGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGTVLDIYYARYADWLFTTPLLLLDLALLAKVDRVTIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAKRSEEVRSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGIETLAFMVLDVTAKVGFGFVLLRSRAILGETEAPEPSAGADASAAD
I wrote this loop for bioawk, but it isn't quite working as I want.
When I run this loop, I get an empty "proteins_outout.fasta" file:
while read p; do
bioawk -c fastx '$name ~ /$p/ { print ">"$name"\n"$seq"\n"; }' proteins_db.fasta > proteins_outout.fasta;
done <list_ID.txt
Although when I try to run the bioawk line for a single seq ID, I do get the right output:
bioawk -c fastx '$name ~ /AF306937.1/ { print ">"$name"\n"$seq"\n"; }'
proteins_db.fasta
lcl|AF306937.1_prot_AAG42454.1_1 MGMDPIALQAGFDLLGDGRPETLWLGIGTLLMIIGTFYFIAQGWGVTDKEAREYYAITILVPGIASAAYLAMFFGIGVTEVELASGAVLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLIGVDALMIVTGLIGALSKTPLARYTWWLFSTIAFLFVLYYLLTSLRSAAAQRSEEVQSTFNTLTALVAVLWTAYPILWIVGTEGAGVVGLGVETLAFMVLDVTAKVGFGFALLRSRAILGETEAPEPSAGT
Why is my loop not working as I want?
Would appreciate any help!
Thanks
Yessssss thank you! I didn't know about
-v
for passing shell variables to bioawk. Working perfectly now!Cool! The
-v
is actually from "standard"awk
, and afaik bioawk can do all awk things plus bioinformatics-specific ones.