Question: I need a perl script to extact data from a unstructured data text file.
0
gravatar for chaoticblue5000
2.6 years ago by
chaoticblue50000 wrote:

I'm trying to extract just the number that comes after om4:rightcontent=". I'm trying to figure out how to do that using perl but I can't quite wrap my head around it since I'm pretty much a noob when it comes to perl. As you can see it's just all on a single line with no breaks, and there are hundreds of pages of this... If anyone can figure it out, that would be awesome!

Here below is what the data looks like.

<area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.43468||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7066" coords="614,248.75,617,300.2643"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.35186||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7038" coords="617,248.75,620,294.25985000000003"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.28862||Invasive Lobular Breast Carcinoma||Estrogen Receptor Negative||MB-7270" coords="620,248.75,623,289.67495"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.24524||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4758" coords="623,248.75,626,286.5299"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.21535||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4548" coords="626,248.75,629,284.362875"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.20532||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-5231" coords="629,248.75,632,283.6357"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.19883||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-5441" coords="632,248.75,635,283.165175"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.18788||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4945" coords="635,248.75,638,282.3713"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.15737||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7155" coords="638,248.75,641,280.159325">

gene • 693 views
ADD COMMENTlink modified 2.6 years ago by WouterDeCoster42k • written 2.6 years ago by chaoticblue50000
3
gravatar for Alex Reynolds
2.6 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

If you're parsing XML, use BeautifulSoup.

ADD COMMENTlink written 2.6 years ago by Alex Reynolds29k
2
gravatar for mastal511
2.6 years ago by
mastal5112.0k
mastal5112.0k wrote:

I would try something like:

my @expr = $line =~ m{<area\ [^>]*\ om4:rightcontent="(-?[\d]+\.[\d]+)\|\|}g;

But I haven't tested it, and it's been a while since I did any complicated regular expressions.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by mastal5112.0k
1

OK, that seems to work for me, but your text looks like xml, so finding an xml parser to extract the bits of data that you want might be a more effective way.

ADD REPLYlink written 2.6 years ago by mastal5112.0k
1
gravatar for h.mon
2.6 years ago by
h.mon28k
Brazil
h.mon28k wrote:

This will do (in a ugly way) what you asked, but I don't know if it is sensible doing what you want. Save as om4.pl, and use ./om4.pl < structured.txt

#!/usr/bin/env perl
use warnings;
use strict;
$/='om4:rightcontent="';
while(<>) {
    my ($nr, @tmp) = split( /\|/, $_, 2);
    print "$nr\n";
}

P.S.: your data is structured, and as others pointed, looks like xml.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by h.mon28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1916 users visited in the last hour