Question: Extract reads from fasta file (specific read_names)and make a new fasta file
0
gravatar for KVC_bioinfo
17 months ago by
KVC_bioinfo380
Boston
KVC_bioinfo380 wrote:

Hello all,

I have a huge fasta file. I need to extract few reads from the original fasta file. I have the read names which I want to extract. I want to make a new fasta file from those extracted reads. I am not sure how to proceed with this. Could someone help me here?

Thank you in advance.

fasta • 1.3k views
ADD COMMENTlink modified 17 months ago by Joseph Hughes2.7k • written 17 months ago by KVC_bioinfo380
5
gravatar for genomax
17 months ago by
genomax65k
United States
genomax65k wrote:

Use faSomeRecords from Jim Kent's utilities at UCSC.

This also is one of the FAQ's on BioStars, which you should have found by searching:
Extract A Group Of Fasta Sequences From A File
How To Extract A Sequence From A Big (6Gb) Multifasta File ?
Extract Sequence From Fasta File Using Ids From A Separate File

ADD COMMENTlink modified 17 months ago • written 17 months ago by genomax65k
2
gravatar for Joseph Hughes
17 months ago by
Joseph Hughes2.7k
Scotland, UK
Joseph Hughes2.7k wrote:

First make sure your fasta sequences in sample1.fa are on a singleline rather than warpped around multple lines:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Then simply grep the identifier/name of the sequence you want and the following line -A 1:

grep -A 1 "idX" sample1_singleline.fa > filtered.fa

If you have multiple identifiers in a file id.txt :

grep -A 1 --no-group-separator -f id.txt sample1_singleline.fa > multiple_filtered.fa
ADD COMMENTlink written 17 months ago by Joseph Hughes2.7k

Hi Joseph,

I've tried your commands. It worked very well with single identifier/name of the sequence, but there is a problem with extracting multiple identifiers. So, the desired ID was typed in a .txt file, and after running the third command, the result is the same with the input without filtering anything. I wonder it might be something wrong with how I organize the inquiry multiple identifiers in .txt file? Could you please provide one example of the .txt inquiry?

Sincerely,

Yeyan

ADD REPLYlink written 4 months ago by yeyan.qiu0
1
gravatar for jrj.healey
17 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Here's the script I created to retrieve fasta's by key.

https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It assumes your list of keys is a simple line separated column of strings.

key1
key2
key3

Then just invoke:

python fastafetcher.py -f myseqs.fasta -k mykeys.txt

NB I've never tested this on very large files, so can't vouch too much for its speed or memory efficiency.

ADD COMMENTlink modified 17 months ago • written 17 months ago by jrj.healey12k

Thank you very much I will try it out

ADD REPLYlink written 17 months ago by KVC_bioinfo380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour