Question

Extract reads from fasta file (specific read_names)and make a new fasta file

0

Entering edit mode

7.6 years ago

KVC_bioinfo ▴ 610

Hello all,

I have a huge fasta file. I need to extract few reads from the original fasta file. I have the read names which I want to extract. I want to make a new fasta file from those extracted reads. I am not sure how to proceed with this. Could someone help me here?

Thank you in advance.

fasta • 7.0k views

ADD COMMENT • link updated 7.6 years ago by Joseph Hughes ★ 3.0k • written 7.6 years ago by KVC_bioinfo ▴ 610

1

Entering edit mode

7.6 years ago

Joe 22k

Here's the script I created to retrieve fasta's by key.

https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It assumes your list of keys is a simple line separated column of strings.

key1
key2
key3

Then just invoke:

python fastafetcher.py -f myseqs.fasta -k mykeys.txt

NB I've never tested this on very large files, so can't vouch too much for its speed or memory efficiency.

ADD COMMENT • link 7.6 years ago by Joe 22k

0

Entering edit mode

Thank you very much I will try it out

ADD REPLY • link 7.6 years ago by KVC_bioinfo ▴ 610

0

Entering edit mode

Hey,

I am trying to use the python code to extract a bunch of sequences from fasta file. I get this warning :

"[yeserin@rackham3 outliers]$ python fastafetcher.py -f myseqs.fasta -k mykeys.txt Traceback (most recent call last): File "fastafetcher.py", line 9, in <module> from Bio import SeqIO ImportError: No module named Bio"

Do I also need to use another module called Bio?

ADD REPLY • link 4.8 years ago by yeserin • 0

0

Entering edit mode

You need biopython installed. Use one of the other options above if you can't install that.

ADD REPLY • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

Ah, I found out that not python but biopython works with the code. The problem is solved.

ADD REPLY • link 4.8 years ago by yeserin • 0

score 5 · Accepted Answer · 2017-11-21

5

Entering edit mode

7.6 years ago

GenoMax 152k

Use faSomeRecords from Jim Kent's utilities at UCSC.

This also is one of the FAQ's on BioStars, which you should have found by searching:
Extract A Group Of Fasta Sequences From A File
How To Extract A Sequence From A Big (6Gb) Multifasta File ?
Extract Sequence From Fasta File Using Ids From A Separate File

ADD COMMENT • link 7.6 years ago by GenoMax 152k

score 2 · Accepted Answer · 2017-11-22

2

Entering edit mode

7.6 years ago

Joseph Hughes ★ 3.0k

First make sure your fasta sequences in sample1.fa are on a singleline rather than warpped around multple lines:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Then simply grep the identifier/name of the sequence you want and the following line -A 1:

grep -A 1 "idX" sample1_singleline.fa > filtered.fa

If you have multiple identifiers in a file id.txt :

grep -A 1 --no-group-separator -f id.txt sample1_singleline.fa > multiple_filtered.fa

ADD COMMENT • link 7.6 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Hi Joseph,

I've tried your commands. It worked very well with single identifier/name of the sequence, but there is a problem with extracting multiple identifiers. So, the desired ID was typed in a .txt file, and after running the third command, the result is the same with the input without filtering anything. I wonder it might be something wrong with how I organize the inquiry multiple identifiers in .txt file? Could you please provide one example of the .txt inquiry?

Sincerely,

Yeyan

ADD REPLY • link 6.6 years ago by yeyan.qiu • 0