Question: Sequence extraction from PFAM db with accession ID
1
gravatar for venu
4.3 years ago by
venu6.2k
Germany
venu6.2k wrote:

Hello, 

I have downloaded the local pfam data base from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release . There are two databases 1) Pfam-A.fasta.gz  2) Pfam-A.full.gz . The first one consist of sequences in fasta format with PFAM accession id in its header, but it does not contain all the sequences of a family. The second one consist of sequences in stockholm format. I have converted them to fasta, these contain all the sequences related to a family but the sequence header does not contain the PFAM accession id. I've more than 1500 pfam id's, I want to extract all the sequences that fall under a family (or accession id). Every stockholm alignment in (2) is having the pfam id at the top as, for example "#=GF AC PF00406" . How can I get over this..any help will be greatly appreciated.

tool proteins pfam bioinformatics • 2.9k views
ADD COMMENTlink modified 4.3 years ago by geek_y9.7k • written 4.3 years ago by venu6.2k
0
gravatar for 5heikki
4.3 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

Pretty sure esl-afetch would work here (ships with HMMER). Or perhaps esl-sfetch. Or maybe if you convert with esl-reformat you will not lose info.. so many options.

# esl-afetch :: retrieve multiple sequence alignment(s) from a file
# Easel h3.1b1 (May 2013)
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the Janelia Farm Software License.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: esl-afetch [options] <msafile> <name>         (retrieves one alignment named <name>)
Usage: esl-afetch [options] -f <msafile> <namefile>  (retrieves all alignments named in <namefile>)
Usage: esl-afetch [options] --index <msafile>        (indexes <msafile>)
 where options are:
  -h              : help; show brief info on version and usage
  -f              : second cmdline arg is a file of names to retrieve
  -o <f>          : output alignments to file <f> instead of stdout
  -O              : output alignment to file named <key>
  --informat <s>  : specify that <msafile> is in format <s>
  --outformat <s> : output fetched alignment(s) in format <s>  [Stockholm]
  --index         : index the <msafile>, creating <msafile>.ssi

 

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by 5heikki8.4k

I tried it already, when I used it with the id (for ex: PF00013), it is saying Every alignment in file must have a name to be retrievable. Failed to find name of alignment #1 . Hope here name means --> "#=GF ID name" in stockholm format. But I only have ID's I dont have the names, how can I find all the names in 30GB file. 

 

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by venu6.2k

Are the seqs in Pfam-A.full aligned or not? Maybe you should have used esl-sfetch. What happens when you just e.g. grep "PF00406" Pfam-A.full

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by 5heikki8.4k

Pfam-A.full contains alignments in stockholm format. I guess esl-sfetch will not work on stockholm format. I can retrieve the accession names with ids but the problem here is, after retrieving with esl-afetch the file name of a family of alignments will be the accession name, that kicks me into more complex situation, If I get ids as file names, then further analysis will become easier. Even more when I used esl-afetch with multiple acc names in a file it is showing Error in configuration: Option -O is incompatible with option(s) -o,-f,--index . How to overcome this.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by venu6.2k
How about looping through your accession list with a simple

while read -r accessionNumber; do something with $accessionNumber; done <accessionList
ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by 5heikki8.4k

of-course all the process is through looping, but when I extract HMM profiles from database it asks for pfam id not the name..I believe there is some way to rename those names to acc. ids.

ADD REPLYlink written 4.3 years ago by venu6.2k
1
while read -r accessionNumber pfamID; do esl-afetch -o $accessionNumber Pfam-A.full $pfamID; done <accessionNumber-pfamID.map

and accessionNumber-pfamID.map has to be in format:

accessionNumber<TAB>pfamId

 

Also, just for clarification, I didn't really understand your ID conundrum. You have more than 1,500 pfam ids and you want to extract all seqs that match those but you're not happy with pfam ids.. confusing..

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by 5heikki8.4k
0
gravatar for geek_y
4.3 years ago by
geek_y9.7k
Barcelona/CRG/London/Imperial
geek_y9.7k wrote:

Use BioPython AlignIO. loop over each record. Each record will have a .id attribute. 

ADD COMMENTlink written 4.3 years ago by geek_y9.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1123 users visited in the last hour