Extract specific sequence from FASTA file
1
0
Entering edit mode
3.1 years ago
ysas ▴ 10

I am trying to extract several sequences from a Fasta file using IDs partially matching with the header. I have written a script to perform it, but I only get one sequence.

grep -Fwf ID_list.txt -A1 input.fasta >> output.fa

Here is input.fasta

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
>xxx|Issori2|100388|CE99641_51257
ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

Here is my ID_list.txt

CE101211_3315 
CE99767_31939
CE99607_9185
CE99543_15407

Here is output.fa

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....

Somehow I only get the sequence matched with the ID listed at the end of text file. Could you please point out how can I get all sequences matched with ID list?

Thank you for your help.

fasta • 3.8k views
ADD COMMENT
6
Entering edit mode
3.1 years ago

code you posted works on my system as expected (mxlinux 19 - 64bit, gnu grep 3.3)

input:

    $ cat test.fa 
    >xxx|Issori2|100290|CE99543_15407
    ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
    >xxx|Issori2|100354|CE99607_9185
    ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
    >xxx|Issori2|100388|CE99641_51257
    ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

$ cat ids.txt 
    CE101211_3315 
    CE99767_31939
    CE99607_9185
    CE99543_15407

output:

$ grep -Fwf ids.txt -A1 test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......

with seqkit:

# EDIT: 
# Original post uses "seqkit grep -nrif" which it's PARTLY matching full FASTA
# header by regular expression and case-ignored, with patterns from file.
# For this case, it's not necessary and may produce "false positive", and also slower.
# For example, `seq_1` matches `seq_10` with `-nri`.

$ seqkit grep -i -f ids.txt test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAA
TCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCT
G......

Please use dedicated tools for the job. for eg. seqtk, seqkit etc. @ ysas

ADD COMMENT
0
Entering edit mode

With seqkit, it works. Thank you for your help!

ADD REPLY
0
Entering edit mode

Hello YusukeSasaki ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 1452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6