Question

Csv File Parsing With Perl

1

Entering edit mode

11.8 years ago

annamantsoki ▴ 40

Hello there, I have a .CSV file in this format:

"Sample" "Variant" "Haplogroup"

"KAsq0001"    "146, 152, 195, 247, 249d, 309+CC, 315+C, 769, 825T, 1005, 1018, 1824, 2758, 2885, 3594, 3970, 4104, 4312, 6216, 6392, 7146, 7256, 7521, 7828, 8468, 8655, 8701, 9540, 10310, 10398, 10535, 10586, 10664, 10688, 10810, 10873, 10915, 11914, 12338, 12705, 13105, 13276, 13506, 13650, 13708, 13928C, 16129, 16187, 16189, 16203, 16223, 16230, 16278, 16291, 16304, 16311"    "F2a"

"Kasq0002"    "146, 152, 153, 195, 200, 247, 309+CC, 315+C, 489, 709, 769, 825T, 1018, 2758, 2885, 3594, 4104, 4312, 5108, 7146, 7220, 7256, 7521, 7867, 8200, 8468, 8655, 9527, 10400, 10664, 10688, 10810, 10915, 11914, 13105, 13276, 13506, 13650, 14569, 14783, 15043, 15301, 15323, 15497, 16129, 16184, 16187, 16189, 16214, 16230, 16278, 16311, 16362"    "G1a3"

As you can see I have three fields. The first one is the name of the sequence sample, the second is the variants that are detected in this sample and the third one is the haplogroup that this sample belongs in. I have to parse that file and have an output like this:

KAsqu0001 146 F2a

KASqu0001 152 F2a

.

KAsqu0002 146 G1a3

KAsqu0002 152 G1a3

,so have each variant linked with its sample and haplogroup.

I am trying to do that with perl but as you can see the second field has multiple values and more than one lines. The sep-delimiter is {tab} and the text delimiter is ". Should I use a hash in order to have the output that I want?

perl variant • 6.2k views

ADD COMMENT • link updated 10.5 years ago by Biostar 20 • written 11.8 years ago by annamantsoki ▴ 40

0

Entering edit mode

And just for the future, this seems to be a programming (perl) question. stackoverflow.com is a very appropriate website for such questions.

ADD REPLY • link 11.8 years ago by Arun 2.4k

4

Entering edit mode

...and this is an inappropriate question here at the BioStars forum? I think bioinformatic programming questions are completely valid here in this context.

I would have preferred to use Python over Perl, but knowing a little Perl doesn't hurt.

ADD REPLY • link 11.8 years ago by Josh Herr 5.8k

0

Entering edit mode

I am sorry if I dint quite get that right. I din't say anything about it being inappropriate here. I merely mentioned that it is more appropriate at stackoverflow than here, meaning that you have more possibility to get better, nicer answers. For example, while its nicer to learn and know and write a perl code yourself to read a CSV file, most people would advice against reinventing the wheel and use a package when available. But that's just what I think.

ADD REPLY • link 11.8 years ago by Arun 2.4k

Ram · Answer 1 · 2012-06-23

7

Entering edit mode

11.8 years ago

Sean Davis 26k

My perl is a bit rusty, but here is something close:

#!/usr/bin/perl
open (FILE,'data.txt');

sub removeQuotes {
    my $string = shift;
    return substr $string,1,length($string)-2;
}

while (<FILE>) {
    chomp $_;
    ($id, $variants, $haplo) = split('\t',$_);
    $id = removeQuotes($id);
    $haplo = removeQuotes($haplo);
    $variants = removeQuotes($variants);
    @variants = split(', ',$variants);
    for $variant (@variants) {
    print "$id\t$variant\t$haplo\n"
    }
}

Sample output looks like:

KAsq0001    146    F2a
KAsq0001    152    F2a
KAsq0001    195    F2a
KAsq0001    247    F2a
KAsq0001    249d    F2a
KAsq0001    309+CC    F2a
KAsq0001    315+C    F2a
KAsq0001    769    F2a
KAsq0001    825T    F2a
KAsq0001    1005    F2a
KAsq0001    1018    F2a
KAsq0001    1824    F2a
KAsq0001    2758    F2a
KAsq0001    2885    F2a
KAsq0001    3594    F2a

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 11.8 years ago by Sean Davis 26k

0

Entering edit mode

Thank you so much...It works perfectly...The only thing that I replaced is the (<FILE>) with (<FILE>). Thanks, again!!!

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 11.8 years ago by annamantsoki ▴ 40

0

Entering edit mode

Looks good to me! However, just to be on the paranoid side, I tend to write that function slightly differently, so that it still returns something even if there are no quotes found (on the off-chance bits of the dataset are inconsistent):

sub extractQuote {
    my $sentence = $_[0];
    if ($sentence =~ /"(.+)"/) {
        $sentence = $1;
    }
    return $sentence;
}

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 10.5 years ago by Jelena Aleksic ▴ 920

Ram · Answer 2 · 2012-06-23

3

Entering edit mode

11.8 years ago

Arun 2.4k

Or you could use Text::CSV; you can find a nicer usage example here.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 11.8 years ago by Arun 2.4k