Extract reads from fasta file (specific read_names)and make a new fasta file
3
0
Entering edit mode
6.4 years ago
KVC_bioinfo ▴ 590

Hello all,

I have a huge fasta file. I need to extract few reads from the original fasta file. I have the read names which I want to extract. I want to make a new fasta file from those extracted reads. I am not sure how to proceed with this. Could someone help me here?

Thank you in advance.

fasta • 6.1k views
ADD COMMENT
5
Entering edit mode
6.4 years ago
GenoMax 141k

Use faSomeRecords from Jim Kent's utilities at UCSC.

This also is one of the FAQ's on BioStars, which you should have found by searching:
Extract A Group Of Fasta Sequences From A File
How To Extract A Sequence From A Big (6Gb) Multifasta File ?
Extract Sequence From Fasta File Using Ids From A Separate File

ADD COMMENT
2
Entering edit mode
6.4 years ago
Joseph Hughes ★ 3.0k

First make sure your fasta sequences in sample1.fa are on a singleline rather than warpped around multple lines:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Then simply grep the identifier/name of the sequence you want and the following line -A 1:

grep -A 1 "idX" sample1_singleline.fa > filtered.fa

If you have multiple identifiers in a file id.txt :

grep -A 1 --no-group-separator -f id.txt sample1_singleline.fa > multiple_filtered.fa
ADD COMMENT
0
Entering edit mode

Hi Joseph,

I've tried your commands. It worked very well with single identifier/name of the sequence, but there is a problem with extracting multiple identifiers. So, the desired ID was typed in a .txt file, and after running the third command, the result is the same with the input without filtering anything. I wonder it might be something wrong with how I organize the inquiry multiple identifiers in .txt file? Could you please provide one example of the .txt inquiry?

Sincerely,

Yeyan

ADD REPLY
1
Entering edit mode
6.4 years ago
Joe 21k

Here's the script I created to retrieve fasta's by key.

https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It assumes your list of keys is a simple line separated column of strings.

key1
key2
key3

Then just invoke:

python fastafetcher.py -f myseqs.fasta -k mykeys.txt

NB I've never tested this on very large files, so can't vouch too much for its speed or memory efficiency.

ADD COMMENT
0
Entering edit mode

Thank you very much I will try it out

ADD REPLY
0
Entering edit mode

Hey,

I am trying to use the python code to extract a bunch of sequences from fasta file. I get this warning :

"[yeserin@rackham3 outliers]$ python fastafetcher.py -f myseqs.fasta -k mykeys.txt Traceback (most recent call last): File "fastafetcher.py", line 9, in <module> from Bio import SeqIO ImportError: No module named Bio"

Do I also need to use another module called Bio?

ADD REPLY
0
Entering edit mode

You need biopython installed. Use one of the other options above if you can't install that.

ADD REPLY
0
Entering edit mode

Ah, I found out that not python but biopython works with the code. The problem is solved.

ADD REPLY

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6