Question: extract sequences based on ids file
0
gravatar for Mehmet
9 months ago by
Mehmet410
Japan
Mehmet410 wrote:

Dear all,

I have a ids list that has random order of ids and a fasta file that has the ids. I want to extract sequence of each id from the fasta file.

I tried but each time output has different order rather than the order in the ids file.

How to extract sequences based on the order in the ids file?

sequence genome gene • 756 views
ADD COMMENTlink modified 9 months ago by Alex Reynolds24k • written 9 months ago by Mehmet410
1
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Via bash shell and awk:

$ while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' sequences.fa; done < patterns.txt
ADD COMMENTlink modified 9 months ago • written 9 months ago by Alex Reynolds24k

Thank you very much. It was what I really wanted.

ADD REPLYlink written 9 months ago by Mehmet410

I Tried it by typing the same in the terminal but the result file is empty.

ADD REPLYlink written 5 months ago by majeedaasim20

That's too bad. Post your files somewhere and maybe I can take a look.

ADD REPLYlink written 5 months ago by Alex Reynolds24k

Alex, I have a fasta file containing many sequences like

>Seq1
atgccaaagtagatacagatagac
>seq2
atattagacagatacaatagacag
>seq3
aggagatacagatacagatac
>seq4
atgacagatacagatacagatacagat
>seq5
agtagataacacagatagacagat
>seq6
agtaacagtacagatacagatacagat

I also have an ID file containing a list of sequence IDs as

seq6
seq3
seq1

NOw I want to extract these sequences from the fasta file bu in the same order. I tried different tools but they do not maintain the order of the sequences extracted. All of these extract sequences in the order as found in the fasta file as seq1 seq3 seq6. but I need in the above order. Thanks thanks

ADD REPLYlink modified 5 months ago • written 5 months ago by majeedaasim20
1

Worked fine for me:

$ cat > sequences.fa
>Seq1
atgccaaagtagatacagatagac
>seq2
atattagacagatacaatagacag
>seq3
aggagatacagatacagatac
>seq4
atgacagatacagatacagatacagat
>seq5
agtagataacacagatagacagat
>seq6
agtaacagtacagatacagatacagat

Then:

$ cat > patterns.txt
seq6
seq3
seq1

Then:

$ while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' sequences.fa; done < patterns.txt
>seq6
agtaacagtacagatacagatacagat
>seq3
aggagatacagatacagatac

Looks like the same order, where there are matches found.

Note that Seq1 is uppercase in your sequences.fa file and lowercase seq1 in patterns.txt.

If case-sensitivity is an issue, pre-process your data to fix patterns or sequence headers. Or perhaps modify the test in awk to apply toupper() or similar on both the pattern and sequence header, before testing for a pattern match.

ADD REPLYlink modified 5 months ago • written 5 months ago by Alex Reynolds24k

Do I need to cat file even if I have all the sequences and IDs list in their corresponding files

ADD REPLYlink written 5 months ago by majeedaasim20

I just typed in terminal but I am getting nothing, where is the result file produced.

while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' my_final_seq.fa; done < my_final_Ids.txt
ADD REPLYlink written 5 months ago by majeedaasim20
1

Please make sure that ID names in your id file are exactly the same with those in your fasta file as Alex suggested.

while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' seq.fa; done < idfile.txt > output.fa

the output.fa file

>seq6
agtaacagtacagatacagatacagat
>seq3
aggagatacagatacagatac
>seq1
atgccaaagtagatacagatagac
ADD REPLYlink written 5 months ago by Mehmet410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour