Question: Extracting Certain Info From Flat Annotation Text File
0
gravatar for bioinfo
6.7 years ago by
bioinfo740
bioinfo740 wrote:

I have FLAT annotation text file of 20 genes. I just want to extract certain lines (e.g DT lines) for each gene. I am planning to match my uniprot ID (e.g. P15519) with the AC line with each FLAT annotation info in between the // signs then matching the lines Start with DTs to extract the lines with DT info with regular perl script but it looks like not that much straightforward to me. I heard Emsembl perl API modules can also extract certain info from FLAT annotation files. I was wondering whats the best way to get these lines out from the FLAT files for every single gene?

Here is a part of the FLAT annotation text file.

........
......... 

//

ID   SPG1_DICDI              Reviewed;         127 AA.
AC   P15519; Q54KF8;
DT   01-APR-1990, integrated into UniProtKB/Swiss-Prot.
DT   01-APR-1990, sequence version 1.
DT   01-MAY-2013, entry version 59.
DE   RecName: Full=Spore germination protein 
GN   Name=gerA; ORFNames=DDB_G0287293;
DR   RefSeq; XP_637265.1; XM_632173.1.
DR   KEGG; ddi:DDB_G0287293; -.
DR   dictyBase; DDB_G0287293; 
FT   CHAIN        26    127       Spore germination protein 1.
FT                                /FTId=
FT   CARBOHYD    118    118       N-linked (GlcNAc...) (Potential).
SQ   SEQUENCE   127 AA;  14049 MW;  8D6FEAF74A06C742 CRC64;
     MNIKNSLILI ISTIFVLSMI NGGLTSDPTC VGAPDGQVYL FSSWDFQGER YVYNISQGYL
     SLPDGFRNNI QSFVSGADVC FVKWYPSEQY QITAGESHRN YAALTNFGQR MDAIIPGNCS
     NIVCSPK
//

ID   CO1A1_MOUSE             Reviewed;        1453 AA.
AC   P11087; Q53WT0; Q60635; Q61367; Q61427; Q63919; Q6PCL3; Q810J9;
DT   01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
DT   06-DEC-2005, sequence version 4.
DT   01-MAY-2013, entry version 136.
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORMS 1 AND 2).
DR   EMBL; U08020; AAA88912.1; -; mRNA.
DR   EMBL; AL662790; CAI25880.1; -; Genomic_DNA.
FT   SIGNAL        1     22
SQ   SEQUENCE   1453 AA;  138032 MW;  0B7F06BBB9A1D5EA CRC64;
     MFSFVDLRLL LLLGATALLT HGQEDIPEVS CIHNGLRVPN GETWKPEVCL ICICHNGTAV
     CDDVQCNEEL DCPNPQRREG ECCAFCPEEY VSPNSEDVGV EGPKGDPGPQ GPRGPVGPPG
     RDGIPGQPGL PGPPGPPGPP GPPGLGGNFA SQMSYGYDEK SAGVSVPGPM GPSGPRGLPG
     PPGAPGPQGF QGPPGEPGEP GGSGPMGPRG PPGPPGKNGD DGEAGKPGRP GERGPPGPQG
     ARGLPGTAGL PGMKGHRGFS GLDGAKGDAG PAGPKGEPGS PGENGAPGQM 

//

.......
......
perl annotation api uniprot • 2.9k views
ADD COMMENTlink modified 6.7 years ago by chbelhumeur200040 • written 6.7 years ago by bioinfo740
2
gravatar for Hamish
6.7 years ago by
Hamish3.1k
UK
Hamish3.1k wrote:

Since you are using Perl and these are UniProtKB flat-file entries, you should have a look at the Swissknife Perl modules (see http://swissknife.sourceforge.net/). While there is some support for dates and versions in BioPerl's Bio::SeqIO::swiss it may be a bit more limited than you require.

To extract just the relevant lines from the file, a little 'grep' will work:

$ grep -E '^((AC)|(DT)   )' prot_uniprotkb_pair.dat 
AC   P11087; Q53WT0; Q60635; Q61367; Q61427; Q63919; Q6PCL3; Q810J9;
DT   01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
DT   06-DEC-2005, sequence version 4.
DT   01-MAY-2013, entry version 136.
AC   P15519; Q54KF8;
DT   01-APR-1990, integrated into UniProtKB/Swiss-Prot.
DT   01-APR-1990, sequence version 1.
DT   01-MAY-2013, entry version 59.

If you want the end-of-entry marker too, then just add it to the groupings (tweaking to allow for it not having padding):

$ grep -E '^((AC   )|(DT   )|(//))' prot_uniprotkb_pair.dat 
AC   P11087; Q53WT0; Q60635; Q61367; Q61427; Q63919; Q6PCL3; Q810J9;
DT   01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
DT   06-DEC-2005, sequence version 4.
DT   01-MAY-2013, entry version 136.
//
AC   P15519; Q54KF8;
DT   01-APR-1990, integrated into UniProtKB/Swiss-Prot.
DT   01-APR-1990, sequence version 1.
DT   01-MAY-2013, entry version 59.
//

For a more dynamic look-up approach a little bit of Perl will work:

#!/usr/bin/env perl
#
# Usage: ./scriptName.pl <uniprotkbAccession> <uniprotkbDataFile>
#
my $queryAcc = $ARGV[0];
my $uniprotData = $ARGV[1];
open(my $DATA_FILE, "<$uniprotData") or die "Unable to open $uniprotData ($!)";
my $foundAcc = 0;
while(<$DATA_FILE>) {
    if(m/^AC   / && m/$queryAcc/) {
        $foundAcc = 1;
    }
    elsif(m/^DT   / && $foundAcc > 0) {
        print $_;
    }
    elsif(m/^\/\//) {
        $foundAcc = 0;
    }
}
close $DATA_FILE;

Note: the Perl script above will work on the cut-down version of the data which includes end-of-entry markers (i.e. '//' lines) as well as full UniProtKB entries.

ADD COMMENTlink written 6.7 years ago by Hamish3.1k

That is what I was exactly trying to get out of the FLAT file using the uniprot accession number as the primary input. Fantastic it's works perfectly. Thanks.

ADD REPLYlink written 6.7 years ago by bioinfo740
1
gravatar for csiu
6.7 years ago by
csiu40
csiu40 wrote:

if you want to just print lines starting with "DT" from your mentioned input-file.txt, you can enter the following in the command line:

awk '{if ($1 ~ /^DT/) print}' input-file.txt

ADD COMMENTlink written 6.7 years ago by csiu40

In awk, the default action after an expression is to print. You can simplify the command as such: awk '/^DT/' input-file.txt

ADD REPLYlink modified 6.6 years ago • written 6.7 years ago by Frédéric Mahé3.0k
1
gravatar for csiu
6.7 years ago by
csiu40
csiu40 wrote:

This perl script also works when you run the following command:

$ perl below-script.pl input-file.txt DT; where the last argument (e.g. DT) is the certain line you want to extract

#!/usr/bin/perl

open (INPUT, $ARGV[0]) or die $1;
open (OUTPUT, ">Output.txt");

my $string = $ARGV[1];

while (<INPUT>){
    if (<> =~ m/^$string/){
    print OUTPUT;
    }
}

close (OUTPUT);
close (INPUT);

The resulting output (below) will be saved into a Output.txt file in your current working directory.

DT   01-APR-1990, integrated into UniProtKB/Swiss-Prot.
DT   01-APR-1990, sequence version 1.
DT   01-MAY-2013, entry version 59.
DT   01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
DT   06-DEC-2005, sequence version 4.
DT   01-MAY-2013, entry version 136.
ADD COMMENTlink written 6.7 years ago by csiu40

If you just open the text file and read line by line and match the DT string, you will end up with all the DT lines together for entire 20 genes in the output file but you will never know which DT belong to which gene. The expected output will be something like that:

P15519:

01-APR-1990, integrated into UniProtKB/Swiss-Prot.
01-APR-1990, sequence version 1.
01-MAY-2013, entry version 59.

P11087:

01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
06-DEC-2005, sequence version 4.
01-MAY-2013, entry version 136.

And so on.

You will have just single uniprot ID (P11087) as an input at a time and the script should get these 3 lines out from the FLAT file. Therefore you have to match the AC lines first to get into the corresponding gene annotation section ( e.g. P11087) then grab the DT lines from there.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by bioinfo740

[UPDATE to perl script]

Now when you run the below perl script, the output is separated by AC

$ perl below-script.pl input-file.txt DT

#!/usr/bin/perl

open (INPUT, $ARGV[0]) or die $1;
open (OUTPUT, ">Output.txt");

my $string = $ARGV[1];

while (<INPUT>){

    if ($_ =~ m/^AC/){
    print "\n$_";
    } elsif ($_ =~ m/^$string/){
    print;
    }

}

close (OUTPUT);
close (INPUT);

Output:
(Note: there will be a blank line at the start of this output file)

AC   P15519; Q54KF8;
DT   01-APR-1990, integrated into UniProtKB/Swiss-Prot.
DT   01-APR-1990, sequence version 1.
DT   01-MAY-2013, entry version 59.

AC   P11087; Q53WT0; Q60635; Q61367; Q61427; Q63919; Q6PCL3; Q810J9;
DT   01-JUL-1989, integrated into UniProtKB/Swiss-Prot.
DT   06-DEC-2005, sequence version 4.
DT   01-MAY-2013, entry version 136.
ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by csiu40
0
gravatar for Elisabeth Gasteiger
6.7 years ago by
Geneva
Elisabeth Gasteiger1.7k wrote:

You can use the UniProt web site, http://www.uniprot.org, to extract these dates, either interactively or programmatically, by using text search or batch retrieval.

For text search, transform your list of accession numbers into id:P11087 or id:P15519 or id:P12345 or id:P99999 and search. Then use the "Customize" link to hide all columns you are not interested in, and add columns for Date of creation, Date of last modification, Date of last sequence modification.

This result can be downloaded in tab-delimited format, e.g. http://www.uniprot.org/uniprot/?query=id%3aP11087+or+id%3aP15519+or+id%3aP12345+or+id%3aP99999&format=tab&columns=id,entry%20name,genes,created,last-modified,sequence-modified

Entry    Entry name    Gene names    Date of creation    Date of last modification    Date of last sequence modification
P12345    AATM_RABIT    GOT2    1989-10-01    2013-04-03    1989-10-01
P11087    CO1A1_MOUSE    Col1a1 Cola1    1989-07-01    2013-05-01    2005-12-06
P99999    CYC_HUMAN    CYCS CYC    1986-07-21    2013-05-01    2007-01-23
P15519    SPG1_DICDI    gerA DDB_G0287293    1990-04-01    2013-05-01    1990-04-01

You can use this link programmatically if you wish (cf http://www.uniprot.org/faq/28). This FAQ also contains a section about batch retrieval.

Alternatively, you could investigate Swissknife http://swissknife.sourceforge.net/docs/, an object-oriented Perl library for parsing the UniProtKB text format.

ADD COMMENTlink written 6.7 years ago by Elisabeth Gasteiger1.7k

I guess that is too specific for the DT section but if I want to grab other part (e.g. DR lines) from the FLAT file then it's not suitable. I will have a look at the Swissknife.

ADD REPLYlink written 6.7 years ago by bioinfo740

You can use the "Customize" feature to include almost any information: Cross-references > Show > select the database(s) you want to include > Show Use "Hide" for the columns you do not wish to see in your result.

The resulting URL can be used programmatically: e.g. to extract HGNC cross-references from all human entries: http://www.uniprot.org/uniprot/?query=organism%3a9606+database%3ahgnc&sort=score&format=tab&columns=id,entry%20name,database%28HGNC%29

The result looks like this:

 Entry    Entry name    Cross-reference (HGNC)
 K9M1U5    IFNL4_HUMAN    44480;
 Q69383    REC2_HUMAN    13915;
 P0DM35    M1BL1_HUMAN     31864;
 Q9GZL8    BPEC1_HUMAN    13228;
 Q3SY05    CA157_HUMAN    26865;
 Q69384    ENK2_HUMAN    13915;
 B3EWF7    EP2A2_HUMAN     3413;
 O76042    ERIT1_HUMAN    1229;
 P0CJ76    HMN9_HUMAN    37166;
 Q8IVG9    HUNIN_HUMAN     7471;
ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Elisabeth Gasteiger1.7k
0
gravatar for chbelhumeur2000
6.7 years ago by
chbelhumeur200040 wrote:

I'm not sure why bioinformatics has chosen certain tools like Perl???? I prefer any flavor BASIC for hacking flat text files. BASIC is the premiere text string language and always has been. MID, LEN, READLINE, etc. are great string functions for the work. You can add, compare, etc. text strings with the simple math operators +, =. Newer flavors of BASIC on Windows platforms are more complicated than the older ones like QBASIC (one of my faves, can still be found on the net, still runs on Windows 7). KBasic and KDE BASIC for LINUX also offer compatibility with classic Berkeley BASIC.

Compare this code for opening a file, reading a line, counting the number of characters in the line and closing a file to what you have to do in PERL.

DIM STRING$ AS STRING

DIM COUNT AS INTEGER

OPEN "filename.ext" FOR INPUT AS #1

STRING$ = READLINE, #1

COUNT = LEN(STRING$)

CLOSE #1

The rest of coding for text string operations compares likewise tween PERL and BASIC. Its just some kind of intellectual snobbery that bioinformatics ignores BASIC. Who's making you people learn these cumbersome languages for hacking text files? There's better tools available. For some of the questions I've seen at forums a simple text editor will suffice to do the hack instead of reams of hard to read arcane PERL script.

There's people all over science and tech who think making things complicated and arcane makes them look like geniuses. H*LL NO! The real geniuses make it all simple easy for everyone to understand. I point out good old Albert's E = Mc2. Its your job as IT professional, professor, instructor, what have you, to remove as much of the arcana and difficulty from your client's, customer's, student's, etc.plate as possible. Its why they pay YOU as the expert! Doing anything less means you're not up to the job and shouldn't be doing it!

The BASIC code for what you want to do would go something like (actual syntax would vary a little depending on flavor, the example below is fairly generic)

 DIM STRING$ AS STRING

 OPEN "filename.txt" FOR INPUT AS #1

 OPEN "filename2.txt" FOR OUTPUT AS #2

 WHILE NOT (EOF) #1 REMARK EOF END OF FILE FUNCTION

      STRING$ = READLINE, #1

      IF MID(STRING$, 1, 2) = "DT" THEN WRITE(STRING$, #2)

 END WHILE

 CLOSE #1,#2
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by chbelhumeur200040
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 948 users visited in the last hour