I need a perl script to extact data from a unstructured data text file.
3
0
Entering edit mode
7.0 years ago

I'm trying to extract just the number that comes after om4:rightcontent=". I'm trying to figure out how to do that using perl but I can't quite wrap my head around it since I'm pretty much a noob when it comes to perl. As you can see it's just all on a single line with no breaks, and there are hundreds of pages of this... If anyone can figure it out, that would be awesome!

Here below is what the data looks like.

<area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.43468||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7066" coords="614,248.75,617,300.2643"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.35186||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7038" coords="617,248.75,620,294.25985000000003"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.28862||Invasive Lobular Breast Carcinoma||Estrogen Receptor Negative||MB-7270" coords="620,248.75,623,289.67495"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.24524||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4758" coords="623,248.75,626,286.5299"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.21535||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4548" coords="626,248.75,629,284.362875"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.20532||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-5231" coords="629,248.75,632,283.6357"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.19883||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-5441" coords="632,248.75,635,283.165175"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.18788||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-4945" coords="635,248.75,638,282.3713"><area class="pPop" shape="rect" om4:leftcontent="Expression value:||Cancer Type:||Legend Value:||Sample Name:" om4:rightcontent="-0.15737||Invasive Ductal Breast Carcinoma||Estrogen Receptor Negative||MB-7155" coords="638,248.75,641,280.159325">

gene • 1.4k views
ADD COMMENT
3
Entering edit mode
7.0 years ago

If you're parsing XML, use BeautifulSoup.

ADD COMMENT
2
Entering edit mode
7.0 years ago
mastal511 ★ 2.1k

I would try something like:

my @expr = $line =~ m{<area\ [^>]*\ om4:rightcontent="(-?[\d]+\.[\d]+)\|\|}g;

But I haven't tested it, and it's been a while since I did any complicated regular expressions.

ADD COMMENT
1
Entering edit mode

OK, that seems to work for me, but your text looks like xml, so finding an xml parser to extract the bits of data that you want might be a more effective way.

ADD REPLY
1
Entering edit mode
7.0 years ago
h.mon 35k

This will do (in a ugly way) what you asked, but I don't know if it is sensible doing what you want. Save as om4.pl, and use ./om4.pl < structured.txt

#!/usr/bin/env perl
use warnings;
use strict;
$/='om4:rightcontent="';
while(<>) {
    my ($nr, @tmp) = split( /\|/, $_, 2);
    print "$nr\n";
}

P.S.: your data is structured, and as others pointed, looks like xml.

ADD COMMENT

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6