Sequence extraction from PFAM db with accession ID
2
1
Entering edit mode
9.1 years ago
venu 7.1k

Hello,

I have downloaded the local pfam data base from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release. There are two databases 1) Pfam-A.fasta.gz 2) Pfam-A.full.gz . The first one consist of sequences in fasta format with PFAM accession id in its header, but it does not contain all the sequences of a family. The second one consist of sequences in stockholm format. I have converted them to fasta, these contain all the sequences related to a family but the sequence header does not contain the PFAM accession id. I've more than 1500 pfam id's, I want to extract all the sequences that fall under a family (or accession id). Every stockholm alignment in (2) is having the pfam id at the top as, for example "#=GF AC PF00406". How can I get over this..any help will be greatly appreciated.

pfam proteins • 6.3k views
ADD COMMENT
0
Entering edit mode
9.1 years ago
5heikki 11k

Pretty sure esl-afetch would work here (ships with HMMER). Or perhaps esl-sfetch. Or maybe if you convert with esl-reformat you will not lose info.. so many options.

# esl-afetch :: retrieve multiple sequence alignment(s) from a file
# Easel h3.1b1 (May 2013)
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the Janelia Farm Software License.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: esl-afetch [options] <msafile> <name>         (retrieves one alignment named <name>)
Usage: esl-afetch [options] -f <msafile> <namefile>  (retrieves all alignments named in <namefile>)
Usage: esl-afetch [options] --index <msafile>        (indexes <msafile>)
 where options are:
  -h              : help; show brief info on version and usage
  -f              : second cmdline arg is a file of names to retrieve
  -o <f>          : output alignments to file <f> instead of stdout
  -O              : output alignment to file named <key>
  --informat <s>  : specify that <msafile> is in format <s>
  --outformat <s> : output fetched alignment(s) in format <s>  [Stockholm]
  --index         : index the <msafile>, creating <msafile>.ssi
ADD COMMENT
0
Entering edit mode

I tried it already, when I used it with the id (for ex: PF00013), it is saying Every alignment in file must have a name to be retrievable. Failed to find name of alignment #1. Hope here name means --> "#=GF ID name" in stockholm format. But I only have ID's I don't have the names, how can I find all the names in 30GB file.

ADD REPLY
0
Entering edit mode

Are the seqs in Pfam-A.full aligned or not? Maybe you should have used esl-sfetch. What happens when you just e.g. grep "PF00406" Pfam-A.full

ADD REPLY
0
Entering edit mode

Pfam-A.full contains alignments in stockholm format. I guess esl-sfetch will not work on stockholm format. I can retrieve the accession names with ids but the problem here is, after retrieving with esl-afetch the file name of a family of alignments will be the accession name, that kicks me into more complex situation, If I get ids as file names, then further analysis will become easier. Even more when I used esl-afetch with multiple acc names in a file it is showing Error in configuration: Option -O is incompatible with option(s) -o,-f,--index. How to overcome this?

ADD REPLY
0
Entering edit mode

How about looping through your accession list with a simple

while read -r accessionNumber; do something with $accessionNumber; done <accessionList
ADD REPLY
0
Entering edit mode

of-course all the process is through looping, but when I extract HMM profiles from database it asks for pfam id not the name..I believe there is some way to rename those names to acc. ids.

ADD REPLY
1
Entering edit mode
while read -r accessionNumber pfamID; do esl-afetch -o $accessionNumber Pfam-A.full $pfamID; done <accessionNumber-pfamID.map

and accessionNumber-pfamID.map has to be in format:

accessionNumber<TAB>pfamId

Also, just for clarification, I didn't really understand your ID conundrum. You have more than 1,500 pfam ids and you want to extract all seqs that match those but you're not happy with pfam ids.. confusing..

ADD REPLY
0
Entering edit mode
9.1 years ago

Use BioPython AlignIO. loop over each record. Each record will have a .id attribute.

ADD COMMENT

Login before adding your answer.

Traffic: 3123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6