Question: Extract reads from fasta file (specific read_names)and make a new fasta file
0
gravatar for KVC_bioinfo
2.8 years ago by
KVC_bioinfo510
Boston
KVC_bioinfo510 wrote:

Hello all,

I have a huge fasta file. I need to extract few reads from the original fasta file. I have the read names which I want to extract. I want to make a new fasta file from those extracted reads. I am not sure how to proceed with this. Could someone help me here?

Thank you in advance.

fasta • 2.8k views
ADD COMMENTlink modified 2.8 years ago by Joseph Hughes2.8k • written 2.8 years ago by KVC_bioinfo510
5
gravatar for genomax
2.8 years ago by
genomax89k
United States
genomax89k wrote:

Use faSomeRecords from Jim Kent's utilities at UCSC.

This also is one of the FAQ's on BioStars, which you should have found by searching:
Extract A Group Of Fasta Sequences From A File
How To Extract A Sequence From A Big (6Gb) Multifasta File ?
Extract Sequence From Fasta File Using Ids From A Separate File

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by genomax89k
2
gravatar for Joseph Hughes
2.8 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

First make sure your fasta sequences in sample1.fa are on a singleline rather than warpped around multple lines:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Then simply grep the identifier/name of the sequence you want and the following line -A 1:

grep -A 1 "idX" sample1_singleline.fa > filtered.fa

If you have multiple identifiers in a file id.txt :

grep -A 1 --no-group-separator -f id.txt sample1_singleline.fa > multiple_filtered.fa
ADD COMMENTlink written 2.8 years ago by Joseph Hughes2.8k

Hi Joseph,

I've tried your commands. It worked very well with single identifier/name of the sequence, but there is a problem with extracting multiple identifiers. So, the desired ID was typed in a .txt file, and after running the third command, the result is the same with the input without filtering anything. I wonder it might be something wrong with how I organize the inquiry multiple identifiers in .txt file? Could you please provide one example of the .txt inquiry?

Sincerely,

Yeyan

ADD REPLYlink written 21 months ago by yeyan.qiu0
1
gravatar for Joe
2.8 years ago by
Joe18k
United Kingdom
Joe18k wrote:

Here's the script I created to retrieve fasta's by key.

https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It assumes your list of keys is a simple line separated column of strings.

key1
key2
key3

Then just invoke:

python fastafetcher.py -f myseqs.fasta -k mykeys.txt

NB I've never tested this on very large files, so can't vouch too much for its speed or memory efficiency.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Joe18k

Thank you very much I will try it out

ADD REPLYlink written 2.8 years ago by KVC_bioinfo510

Hey,

I am trying to use the python code to extract a bunch of sequences from fasta file. I get this warning :

"[yeserin@rackham3 outliers]$ python fastafetcher.py -f myseqs.fasta -k mykeys.txt Traceback (most recent call last): File "fastafetcher.py", line 9, in <module> from Bio import SeqIO ImportError: No module named Bio"

Do I also need to use another module called Bio?

ADD REPLYlink written 14 days ago by yeserin0

You need biopython installed. Use one of the other options above if you can't install that.

ADD REPLYlink modified 14 days ago • written 14 days ago by genomax89k

Ah, I found out that not python but biopython works with the code. The problem is solved.

ADD REPLYlink written 14 days ago by yeserin0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1767 users visited in the last hour