Question: How do I extract Fasta Sequences based on a list of IDs?
0
gravatar for a.rex
2.1 years ago by
a.rex230
a.rex230 wrote:

I have a fasta sequence file that looks like this:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ 
 .........

I have another file that has a subset of headers for this file:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
.......

How can I use the header file to extract out my fasta sequences only for these headers?

Thank you in advance!

sequence grep • 2.5k views
ADD COMMENTlink modified 2.1 years ago by Rashedul Islam370 • written 2.1 years ago by a.rex230
2
gravatar for genomax
2.1 years ago by
genomax89k
United States
genomax89k wrote:

This is one of FAQ's on biostars. You should search for additional threads but here are some to get you started.

Extracting specific IDs + sequence from multifasta
extract sequences based on ids file
Extracting specific sequences from a big fasta file using ids of the sequences to be excluded

My personal favorite is a program from Jim Kent/UCSC:

faSomeRecords

Download the linux version linked (macOS available elsewhere on that site). Add execute permissions chmod a+x faSomeRecords.

 ./faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

NOTE: You should remove > from your header list file when using it as an input in place of listFile.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by genomax89k
1
gravatar for Rashedul Islam
2.1 years ago by
Canada
Rashedul Islam370 wrote:

You can use this:

while read line; do grep -A 5 "$line" fastaFile; done <listFile

-A in grep means lines-after the match.

ADD COMMENTlink written 2.1 years ago by Rashedul Islam370
1

This will only work if the sequences are all 5 lines long which is not a generally safe assumption.

ADD REPLYlink written 2.1 years ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1641 users visited in the last hour