I have downloaded the local pfam data base from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release. There are two databases 1) Pfam-A.fasta.gz 2) Pfam-A.full.gz . The first one consist of sequences in fasta format with PFAM accession id in its header, but it does not contain all the sequences of a family. The second one consist of sequences in stockholm format. I have converted them to fasta, these contain all the sequences related to a family but the sequence header does not contain the PFAM accession id. I've more than 1500 pfam id's, I want to extract all the sequences that fall under a family (or accession id). Every stockholm alignment in (2) is having the pfam id at the top as, for example "#=GF AC PF00406". How can I get over this..any help will be greatly appreciated.
Pretty sure esl-afetch would work here (ships with HMMER). Or perhaps esl-sfetch. Or maybe if you convert with esl-reformat you will not lose info.. so many options.
# esl-afetch :: retrieve multiple sequence alignment(s) from a file
# Easel h3.1b1 (May 2013)
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the Janelia Farm Software License.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: esl-afetch [options] <msafile> <name> (retrieves one alignment named <name>)
Usage: esl-afetch [options] -f <msafile> <namefile> (retrieves all alignments named in <namefile>)
Usage: esl-afetch [options] --index <msafile> (indexes <msafile>)
where options are:
-h : help; show brief info on version and usage
-f : second cmdline arg is a file of names to retrieve
-o <f> : output alignments to file <f> instead of stdout
-O : output alignment to file named <key>
--informat <s> : specify that <msafile> is in format <s>
--outformat <s> : output fetched alignment(s) in format <s> [Stockholm]
--index : index the <msafile>, creating <msafile>.ssi