Question: How do I extract Fasta Sequences based on a list of IDs?
0
gravatar for a.rex
3 months ago by
a.rex170
a.rex170 wrote:

I have a fasta sequence file that looks like this:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ 
 .........

I have another file that has a subset of headers for this file:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
.......

How can I use the header file to extract out my fasta sequences only for these headers?

Thank you in advance!

sequence grep • 220 views
ADD COMMENTlink modified 3 months ago by Rashedul Islam290 • written 3 months ago by a.rex170
1
gravatar for genomax
3 months ago by
genomax58k
United States
genomax58k wrote:

This is one of FAQ's on biostars. You should search for additional threads but here are some to get you started.

Extracting specific IDs + sequence from multifasta
extract sequences based on ids file
Extracting specific sequences from a big fasta file using ids of the sequences to be excluded

My personal favorite is a program from Jim Kent/UCSC:

faSomeRecords

Download the linux version linked (macOS available elsewhere on that site). Add execute permissions chmod a+x faSomeRecords.

 ./faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

NOTE: You should remove > from your header list file when using it as an input in place of listFile.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax58k
1
gravatar for Rashedul Islam
3 months ago by
Canada
Rashedul Islam290 wrote:

You can use this:

while read line; do grep -A 5 "$line" fastaFile; done <listFile

-A in grep means lines-after the match.

ADD COMMENTlink written 3 months ago by Rashedul Islam290

This will only work if the sequences are all 5 lines long which is not a generally safe assumption.

ADD REPLYlink written 3 months ago by jrj.healey8.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1179 users visited in the last hour