Question: Csv File Parsing With Perl
1
gravatar for annamantsoki
6.7 years ago by
annamantsoki40
Barcelona
annamantsoki40 wrote:

Hello there, I have a .CSV file in this format:

"Sample" "Variant" "Haplogroup"

"KAsq0001"    "146, 152, 195, 247, 249d, 309+CC, 315+C, 769, 825T, 1005, 1018, 1824, 2758, 2885, 3594, 3970, 4104, 4312, 6216, 6392, 7146, 7256, 7521, 7828, 8468, 8655, 8701, 9540, 10310, 10398, 10535, 10586, 10664, 10688, 10810, 10873, 10915, 11914, 12338, 12705, 13105, 13276, 13506, 13650, 13708, 13928C, 16129, 16187, 16189, 16203, 16223, 16230, 16278, 16291, 16304, 16311"    "F2a"

"Kasq0002"    "146, 152, 153, 195, 200, 247, 309+CC, 315+C, 489, 709, 769, 825T, 1018, 2758, 2885, 3594, 4104, 4312, 5108, 7146, 7220, 7256, 7521, 7867, 8200, 8468, 8655, 9527, 10400, 10664, 10688, 10810, 10915, 11914, 13105, 13276, 13506, 13650, 14569, 14783, 15043, 15301, 15323, 15497, 16129, 16184, 16187, 16189, 16214, 16230, 16278, 16311, 16362"    "G1a3"

As you can see I have three fields. The first one is the name of the sequence sample, the second is the variants that are detected in this sample and the third one is the haplogroup that this sample belongs in. I have to parse that file and have an output like this:

KAsqu0001 146 F2a

KASqu0001 152 F2a

.

.

.

KAsqu0002 146 G1a3

KAsqu0002 152 G1a3

,so have each variant linked with its sample and haplogroup.

I am trying to do that with perl but as you can see the second field has multiple values and more than one lines. The sep-delimiter is {tab} and the text delimiter is ". Should I use a hash in order to have the output that I want?

perl variant • 3.7k views
ADD COMMENTlink modified 5.3 years ago by Biostar ♦♦ 20 • written 6.7 years ago by annamantsoki40

And just for the future, this seems to be a programming (perl) question. stackoverflow.com is a very appropriate website for such questions.

ADD REPLYlink written 6.7 years ago by Arun2.3k
4

...and this is an inappropriate question here at the BioStars forum? I think bioinformatic programming questions are completely valid here in this context.

I would have preferred to use Python over Perl, but knowing a little Perl doesn't hurt.

ADD REPLYlink written 6.7 years ago by Josh Herr5.6k

I am sorry if I dint quite get that right. I din't say anything about it being inappropriate here. I merely mentioned that it is more appropriate at stackoverflow than here, meaning that you have more possibility to get better, nicer answers. For example, while its nicer to learn and know and write a perl code yourself to read a CSV file, most people would advice against reinventing the wheel and use a package when available. But that's just what I think.

ADD REPLYlink written 6.7 years ago by Arun2.3k
7
gravatar for Sean Davis
6.7 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

My perl is a bit rusty, but here is something close:

#!/usr/bin/perl
open (FILE,'data.txt');

sub removeQuotes {
    my $string = shift;
    return substr $string,1,length($string)-2;
}

while (<FILE>) {
    chomp $_;
    ($id, $variants, $haplo) = split('\t',$_);
    $id = removeQuotes($id);
    $haplo = removeQuotes($haplo);
    $variants = removeQuotes($variants);
    @variants = split(', ',$variants);
    for $variant (@variants) {
    print "$id\t$variant\t$haplo\n"
    }
}

Sample output looks like:

KAsq0001    146    F2a
KAsq0001    152    F2a
KAsq0001    195    F2a
KAsq0001    247    F2a
KAsq0001    249d    F2a
KAsq0001    309+CC    F2a
KAsq0001    315+C    F2a
KAsq0001    769    F2a
KAsq0001    825T    F2a
KAsq0001    1005    F2a
KAsq0001    1018    F2a
KAsq0001    1824    F2a
KAsq0001    2758    F2a
KAsq0001    2885    F2a
KAsq0001    3594    F2a
ADD COMMENTlink modified 5.3 years ago by Alex Reynolds27k • written 6.7 years ago by Sean Davis25k

Thank you so much...It works perfectly...The only thing that I replaced is the (&ltFILE&gt) with (<FILE>). Thanks, again!!!

ADD REPLYlink written 6.7 years ago by annamantsoki40

Looks good to me! However, just to be on the paranoid side, I tend to write that function slightly differently, so that it still returns something even if there are no quotes found (on the off-chance bits of the dataset are inconsistent):

sub extractQuote {
    my $sentence = $_[0];
    if ($sentence =~ /"(.+)"/) {
        $sentence = $1;
    }
    return $sentence;
}
ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Jelena Aleksic900
3
gravatar for Arun
6.7 years ago by
Arun2.3k
Germany
Arun2.3k wrote:

Or you could use Text::CSV; you can find a nicer usage example here.

ADD COMMENTlink written 6.7 years ago by Arun2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1476 users visited in the last hour