Embl File Corrections
0
0
Entering edit mode
10.8 years ago
CS ▴ 10

Dear All,

I am trying to correct a batch of old embl files to the new agreed format.

currently agreed format for the Pfam domains is:

/inference="protein motif:PFAM:PF03466" as an example , we can have multiple inferences per entry , but not repeat domains i.e. if there are repeat domains , we should just have one /inference="..." per entry.

/note=*domain "HMMPfam:PF09339;HTH_lclR;2e-05;codon 269-306"

so this would become /Inference="protein motif:Pfam:PF09339"

and duplicates per entry should be removed.

I did this with perl regex but now when I converted to /inference etc, but sometimes there was originally a second line which didn’t get converted with the script, so we have something like

FT “495-678”

And several other pieces of comments, the problem is that is very different what can be found there, so it is very difficult to pick out with a regex I am thinking. This will prevent the embl file from being valid.

Any help would be appreciated. I am also attaching a small file below:

ID   Lsalivarius_cp400_4_358425-448092; SV 1; linear; unassigned DNA; STD; UNC; 89668 BP.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..89668
FT                   /note="scaffold4|size89668"
FT   CDS             complement(671..1594)
FT                   /note="*GO: aspect=; GOid=GO:; term=; evidence=IEA;
FT                   date=20121112"
FT                   /note="*GO: aspect=Component; GOid=GO:0016020;
FT                   term=membrane; evidence=IEA; date=20121112"
FT                   /note="*db_xref: 07-11-2012"
FT                   /note="*db_xref: Membrane insertion protein, OxaA/YidC"
FT                   /note="*domain: PANTHER:PTHR12428;IPR001708;6.4E-35;codon
FT                   35-247"
FT                   /note="*domain: PANTHER:PTHR12428:SF11;T;6.4E-35;codon
FT                   35-247"
FT                   /note="*domain: Pfam:PF02096;60Kd inner membrane
FT                   protein;9.8E-45;codon 57-247"
FT                   /note="*domain: PRINTS:PR00701;60kDa inner membrane protein
FT                   signature;8.6E-6;codon 131-154"
FT                   /note="*domain: PRINTS:PR00701;60kDa inner membrane protein
FT                   signature;8.6E-6;codon 214-237"
FT                   /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT                   membrane-bound protein predicted to be embedded in the
FT                   membrane.;-;codon 230-249"
FT                   /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT                   membrane-bound protein predicted to be embedded in the
FT                   membrane.;-;codon 208-224"
FT                   /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT                   membrane-bound protein predicted to be outside the
FT                   membrane, in the extracellular region.;-;codon 157-175"
FT                   /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT                   membrane-bound protein predicted to be outside the
FT                   membrane, in the extracellular region.;-;codon 27-49"
FT                   /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT                   membrane-bound protein predicted to be embedded in the
FT                   membrane.;-;codon 131-156"
FT                   /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT                   membrane-bound protein predicted to be outside the
FT                   membrane, in the cytoplasm.;-;codon 197-207"
FT                   /note="*domain: Phobius:SIGNAL_PEPTIDE_H_REGION;Hydrophobic
FT                   region of a signal peptide.;-;codon 8-20"
FT                   /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT                   membrane-bound protein predicted to be outside the
FT                   membrane, in the extracellular region.;-;codon 225-229"
FT                   /note="*domain: Phobius:SIGNAL_PEPTIDE_C_REGION;C-terminal
FT                   region of a signal peptide.;-;codon 21-26"
FT                   /note="*domain: Phobius:SIGNAL_PEPTIDE;Signal peptide
FT                   region;-;codon 1-26"
FT                   /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT                   membrane-bound protein predicted to be embedded in the
FT                   membrane.;-;codon 50-74"
FT                   /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT                   membrane-bound protein predicted to be outside the
FT                   membrane, in the cytoplasm.;-;codon 250-307"
FT                   /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT                   membrane-bound protein predicted to be embedded in the
FT                   membrane.;-;codon 176-196"
FT                   /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT

I can send you the entire file.

Thanks CS

format • 2.6k views
ADD COMMENT
0
Entering edit mode

wrap the file in the code format. It would be easier to visualize. there will be an icon with 101010 on it.

ADD REPLY
0
Entering edit mode

Hi Bharat , finally i found the way to view it clearly. Can you please help me from here?

Thanks CS

ADD REPLY
0
Entering edit mode

Be sure to specify exactly what you have now and what you want the change to be. You say "/inference" and "/Inference" above but I'm guessing it should be the former. Also, there is no "/domain" tag in the example you show, but there is a "note" tag specifying the domain (i.e., "/note="*domain:..."), is that what you want to change?

ADD REPLY
0
Entering edit mode

Hi SES,

Thanks for the heads up. I have : /note=*domain "HMMPfam:PF09339;HTH_lclR;2e-05;codon 269-306"

so this would become /Inference="protein motif:Pfam:PF09339" I only want Pfam domains to be the part of final files.

Thanks

ADD REPLY
0
Entering edit mode

Have you tried contacting ENA about this directly (via datasubs@ebi.ac.uk)? (To be honest, I think removing the ability to store positions of the matches and the number of matches seems a bit strange/silly!)

ADD REPLY
0
Entering edit mode

Thanks Sarah,

I have written to ENA about this issue. Lets see what I get from them. CS

ADD REPLY

Login before adding your answer.

Traffic: 2389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6