Question: How do I extract Fasta Sequences based on a list of IDs?
0
gravatar for a.rex
8 weeks ago by
a.rex150
a.rex150 wrote:

I have a fasta sequence file that looks like this:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ 
 .........

I have another file that has a subset of headers for this file:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
.......

How can I use the header file to extract out my fasta sequences only for these headers?

Thank you in advance!

sequence grep • 174 views
ADD COMMENTlink modified 8 weeks ago by Rashedul Islam290 • written 8 weeks ago by a.rex150
1
gravatar for genomax
8 weeks ago by
genomax56k
United States
genomax56k wrote:

This is one of FAQ's on biostars. You should search for additional threads but here are some to get you started.

Extracting specific IDs + sequence from multifasta
extract sequences based on ids file
Extracting specific sequences from a big fasta file using ids of the sequences to be excluded

My personal favorite is a program from Jim Kent/UCSC:

faSomeRecords

Download the linux version linked (macOS available elsewhere on that site). Add execute permissions chmod a+x faSomeRecords.

 ./faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

NOTE: You should remove > from your header list file when using it as an input in place of listFile.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by genomax56k
1
gravatar for Rashedul Islam
8 weeks ago by
Canada
Rashedul Islam290 wrote:

You can use this:

while read line; do grep -A 5 "$line" fastaFile; done <listFile

-A in grep means lines-after the match.

ADD COMMENTlink written 8 weeks ago by Rashedul Islam290

This will only work if the sequences are all 5 lines long which is not a generally safe assumption.

ADD REPLYlink written 8 weeks ago by jrj.healey6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1579 users visited in the last hour