Question: Extract reads from fasta file (specific read_names)and make a new fasta file
0
gravatar for KVC_bioinfo
2.2 years ago by
KVC_bioinfo410
Boston
KVC_bioinfo410 wrote:

Hello all,

I have a huge fasta file. I need to extract few reads from the original fasta file. I have the read names which I want to extract. I want to make a new fasta file from those extracted reads. I am not sure how to proceed with this. Could someone help me here?

Thank you in advance.

fasta • 2.1k views
ADD COMMENTlink modified 2.2 years ago by Joseph Hughes2.8k • written 2.2 years ago by KVC_bioinfo410
5
gravatar for genomax
2.2 years ago by
genomax78k
United States
genomax78k wrote:

Use faSomeRecords from Jim Kent's utilities at UCSC.

This also is one of the FAQ's on BioStars, which you should have found by searching:
Extract A Group Of Fasta Sequences From A File
How To Extract A Sequence From A Big (6Gb) Multifasta File ?
Extract Sequence From Fasta File Using Ids From A Separate File

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax78k
2
gravatar for Joseph Hughes
2.2 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

First make sure your fasta sequences in sample1.fa are on a singleline rather than warpped around multple lines:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Then simply grep the identifier/name of the sequence you want and the following line -A 1:

grep -A 1 "idX" sample1_singleline.fa > filtered.fa

If you have multiple identifiers in a file id.txt :

grep -A 1 --no-group-separator -f id.txt sample1_singleline.fa > multiple_filtered.fa
ADD COMMENTlink written 2.2 years ago by Joseph Hughes2.8k

Hi Joseph,

I've tried your commands. It worked very well with single identifier/name of the sequence, but there is a problem with extracting multiple identifiers. So, the desired ID was typed in a .txt file, and after running the third command, the result is the same with the input without filtering anything. I wonder it might be something wrong with how I organize the inquiry multiple identifiers in .txt file? Could you please provide one example of the .txt inquiry?

Sincerely,

Yeyan

ADD REPLYlink written 14 months ago by yeyan.qiu0
1
gravatar for Joe
2.2 years ago by
Joe16k
United Kingdom
Joe16k wrote:

Here's the script I created to retrieve fasta's by key.

https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It assumes your list of keys is a simple line separated column of strings.

key1
key2
key3

Then just invoke:

python fastafetcher.py -f myseqs.fasta -k mykeys.txt

NB I've never tested this on very large files, so can't vouch too much for its speed or memory efficiency.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Joe16k

Thank you very much I will try it out

ADD REPLYlink written 2.2 years ago by KVC_bioinfo410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour