Question: Extract specific sequence from FASTA file
0
gravatar for ysas
6 months ago by
ysas0
ysas0 wrote:

I am trying to extract several sequences from a Fasta file using IDs partially matching with the header. I have written a script to perform it, but I only get one sequence.

grep -Fwf ID_list.txt -A1 input.fasta >> output.fa

Here is input.fasta

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
>xxx|Issori2|100388|CE99641_51257
ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

Here is my ID_list.txt

CE101211_3315 
CE99767_31939
CE99607_9185
CE99543_15407

Here is output.fa

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....

Somehow I only get the sequence matched with the ID listed at the end of text file. Could you please point out how can I get all sequences matched with ID list?

Thank you for your help.

fasta • 308 views
ADD COMMENTlink modified 6 months ago by h.mon29k • written 6 months ago by ysas0
3
gravatar for cpad0112
6 months ago by
cpad011213k
India
cpad011213k wrote:

code you posted works on my system as expected (mxlinux 19 - 64bit, gnu grep 3.3)

input:

    $ cat test.fa 
    >xxx|Issori2|100290|CE99543_15407
    ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
    >xxx|Issori2|100354|CE99607_9185
    ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
    >xxx|Issori2|100388|CE99641_51257
    ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

$ cat ids.txt 
    CE101211_3315 
    CE99767_31939
    CE99607_9185
    CE99543_15407

output:

$ grep -Fwf ids.txt -A1 test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......

with seqkit:

$ seqkit grep -nrif ids.txt test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAA
TCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCT
G......

Please use dedicated tools for the job. for eg. seqtk, seqkit etc. @ YusukeSasaki

ADD COMMENTlink modified 6 months ago • written 6 months ago by cpad011213k

With seqkit, it works. Thank you for your help!

ADD REPLYlink written 6 months ago by ysas0

Hello YusukeSasaki ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 6 months ago by finswimmer13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour