Question: Parse A Embl Like Format To Fasta Format
1
gravatar for Empyrean
8.4 years ago by
Empyrean160
Empyrean160 wrote:

Helo all, I wanted to parse aEMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..

ID  US74811111-0005    
OO  giensis    
OS  giensis    
SN  US74811111    
PT  I-003, a gene and methods for its use
PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI  Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR  20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE  US200304985AN  20070765494
P1  Compositions and methods and seeds are provided. 
    MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//

The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.

>US74811111-0005 ;  I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

Like this i have 50,000 sequences in a single file which should be converted to fasta format

perl format conversion awk • 3.9k views
ADD COMMENTlink modified 8.4 years ago by Alex Richter210 • written 8.4 years ago by Empyrean160
5
gravatar for Pierre Lindenbaum
8.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

the following awk script should do the job:

/^ID/   {printf(">%s;",$0); next;}
/^(PT|PA)/  {printf(" %s;",$0); next;}
/^\/\// {printf("\n"); next;}
/^    / {printf("\n%s",substr($0,5)); next;}
    {
    /* ignore default */
    }
END   {
    printf("\n");
    }

> awk -f file.awk file.txt
>ID  US74811111-0005    ; PT  I-003, a gene and methods for its use; PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC;
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Pierre Lindenbaum123k

this is not printing the sequence at the end.. thanks for the reply

ADD REPLYlink written 8.4 years ago by Empyrean160

I've added a 'END' close in the script.

ADD REPLYlink written 8.4 years ago by Pierre Lindenbaum123k

I've added a 'END' statement in the script.

ADD REPLYlink written 8.4 years ago by Pierre Lindenbaum123k

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example..

"LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLYlink written 8.4 years ago by Empyrean160

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example.. " LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLYlink written 8.4 years ago by Empyrean160

Can you use this script to process multiple input files and output to multiple files such as awk -f file.awk *.txt >> *.fasta

ADD REPLYlink written 7.3 years ago by biomed.lu20
2
gravatar for Steve Moss
8.4 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

This Python script should work? Takes input file as an argument e.g. embllike2fasta.py infile.txt:

import sys
dict = {}
infile = open(sys.argv[1], "r")
outfile = open("outfile.fas", "w")
while 1:
    line = infile.readline()
    if not line:
        break
    parts = line.split(None, 1)
    if len(parts) == 1 and parts[0] == "//":
        outfile.write(">" + dict["ID"] + " ; " + dict["PT"] + " ; " + dict["PA"] + "\n")
        outfile.write(dict["seq"] + "\n")
    elif len(parts) == 1:
        dict["seq"] = parts[0].strip()
    else:
        dict[parts[0]] = parts[1].strip()
outfile.close()
infile.close()

I haven't tested it, and just hacked it up in the browser on my iPad, so it might not work? Plus it's late :-S lol!

Update: Tested it on my machine this morning, made a few quick edits! Very hacky, but it works :D

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Steve Moss2.3k
1
gravatar for Alex Richter
8.4 years ago by
Alex Richter210
San Diego
Alex Richter210 wrote:

How about this? It's remarkably hacky, but should do the trick.

#!/usr/bin/env perl
use strict;
use warnings;
$/ = "//\n";
while (<>) {
    chomp;
    my %record;
    foreach my $line (split /\n/) {
        my ($key, $val) = unpack("A4A*", $line);
        $record{$key} .= $val;
    }
    print ">$record{ID} $record{PT} $record{PA}\n$record{''}\n";
}

The important bits are lines 4:($/ = "//n"), which sets the system to read one record at a time instead of a line, and 9:(unpack("A4A*", $line)), which splits the string into the EMBL key and the value.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Alex Richter210

I am getting this error when i execute this

Can't modify single ref constructor in scalar assignment at embl2fasta.pl line 4, near ""//n";"

ADD REPLYlink written 8.4 years ago by Empyrean160

I am getting this error when i execute this "Can't modify single ref constructor in scalar assignment at eml2fasa.pl line 4, near ""//n";" "

ADD REPLYlink written 8.4 years ago by Empyrean160

Hmm. That error normally only occurs when you attempt to assign a value to a reference (e.g. $/ = "//n"; ) If you can show me lines 3-5, I might have a better idea.

ADD REPLYlink written 8.4 years ago by Alex Richter210
0
gravatar for Ketil
8.4 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Or just:

egrep -e '^(ID|  )' | sed -e 's/^ID  />/g' -e 's/^    //g'

Oh, I see you wanted the contents of PT there too. I leave it as an excercise for the reader! :-)

ADD COMMENTlink written 8.4 years ago by Ketil4.0k

i dont know how to use sed :(

ADD REPLYlink written 8.4 years ago by Empyrean160

i ran it but that was not even close to what i need, its just adding ">" symbol in place of ID and rest of input file stays as is

ADD REPLYlink written 8.4 years ago by Empyrean160

Strange. The egrep in front should pick out only ID and sequence lines from the file, so I don't see how it could retain the rest of the file. The sed stuff replaces ID, but it should also remove initial spaces on the sequence data. But it's just a quick hack anyway, you're probably better off with some other solution.

ADD REPLYlink written 8.4 years ago by Ketil4.0k

I guess I should add that I only tested it on the example in the question, so if that isn't representative of the actual data, all bets are off, of course.

ADD REPLYlink written 8.4 years ago by Ketil4.0k
0
gravatar for Gaffa
8.4 years ago by
Gaffa490
Gaffa490 wrote:

This one-liner should do it:

perl -ne 'if ($_ =~ /^ID  (.+)/) { print ">$1" } elsif ($_ =~ /^P[TA]  (.+)/) { print "; $1 " } elsif ($_ =~ /^\s{4}(.+)/) { print "\n$1\n" }'

Output:

>US74811111-0005    ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
ADD COMMENTlink written 8.4 years ago by Gaffa490

not working for my dataset.. :(

ADD REPLYlink written 8.4 years ago by Empyrean160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1241 users visited in the last hour