Question

Any Modules Available To Parse This File ?

0

Entering edit mode

11.7 years ago

Tarun Mohite • 0

I have a file which is in the following format : (PS : This is RNA Seq data)

chrID start stop Gene Sample1_ExonCount Sample2_ExonCount Sample3_ExonCount

chr1  300   350  ABC       200                 187                167
chr1  400   512  ABC       112                 110                105
chr1  534   587  ABC       87                  76                  55
chr1  612   687  PQR       12                  15                  24
chr1  812   898  PQR       10                  13                  12     
chr2   ....     ..... .....   ..............................................................

and so on.

Basically, I have to parse this file and calculate some statistics. For example, the first 3 records in this file are the exons of the gene ABC. (i.e ABC has 3 exons). And the counts are (I guess) the number of reads mapping to these exons in each of the samples (1, 2 and 3).

What I have to do is to initially create a "Size" column and calculate the size of each exon. For eg. the first record (i.e the first exon of gene ABC has size 350 - 300 = 50). I have to iterate through every record and calculate the size in this way.

Now comes the tricky part. FOr each gene, I have to add up the sizes. So, for gene ABC, I have 50 (1st record) + 112 (2nd record) + 53 (3rd record) = 215. And then divide this number by total exon counts in each sample. i.e for gene ABC and sample 1 it wil be 215/(200+112+87) = 215/399.

Are there any fast computational techniques by which I can do this. Are there any modules available in Python specifically with which I can parse these files fast ? I am just a beginner with programming and I would really like to know. The last time I parsed such a file, it had serious performance issues.

Thanks.

python exon counts • 2.2k views

ADD COMMENT • link updated 11.7 years ago by Neilfws 49k • written 11.7 years ago by Tarun Mohite • 0

0

Entering edit mode

Can you show the code you used last time to parse the file that had the performance issues? Maybe we can optimize it.

ADD REPLY • link 11.7 years ago by Damian Kao 16k

score 6 · Answer 1 · 2012-07-30

Perl solution:

#!/usr/bin/perl
use strict;
use warnings;

my %data = ();
my ($chr, $ini, $end, $gen, $s1, $s2, $s3, $size);

while (<>) {
     next if (m/chrID/ or m/^\n/); # skip header and empty lines
     chomp;
     ($chr, $ini, $end, $gen, $s1, $s2, $s3) = split (/\s+/, $_);
     $data{$gen}{'size'} += $end - $ini;
     $data{$gen}{'S1'} += $s1;
     $data{$gen}{'S2'} += $s2;
     $data{$gen}{'S3'} += $s3;
}

print "GENE\tLENGTH\tSAMPLE1\tSAMPLE2\tSAMPLE3\n";
foreach $gen (sort keys %data) { # if you don't care to spend a little time sorting, otherwise delete 'sort'
    next unless ($data{$gen}{'size'} > 1); # sanity check for gene size
    $size = $data{$gen}{'size'};
    $s1 = $data{$gen}{'S1'} / $size;
    $s2 = $data{$gen}{'S2'} / $size;
    $s3 = $data{$gen}{'S3'} / $size;
    print "$gen\t$size\t$s1\t$s2\t$s3\n";
}

you can run it with "perl scriptname.pl < yourtable > your_output"

But if you are doing serious data analysis, maybe you can better learning some tools to deal with this type of data, I suggest you the wonderful R:Bioconductor http://www.bioconductor.org/

score 6 · Answer 2 · 2012-07-30

JC is absolutely correct when he says "maybe you can better learning some tools to deal with this type of data."

Your data are in a simple format: delimited text (either space- or tab-delimited). Sure, you could invest time writing parsers and optimizing performance. Or you could just get the job done, using an existing tool. In this case, I'd suggest R (or similar) is the most appropriate.

I'm guessing you don't know R so these code examples may not make much sense, but just to demonstrate how easy it is to perform the tasks you describe - each can be performed using 1 line of R code:

1. Read in the file exons.txt

# assume tab-delimited with header and one record per line, no blank lines
e <- read.table("exons.txt", header = T, stringsAsFactors = F, sep = "\t")

2. Add "Size" column with exon size (assuming 0-based coordinates)

e$Size <- e$stop - e$start

3. Sum exon sizes for each gene

e.sum <- aggregate(Size~Gene, data = e, sum)

4. Sum exon counts per gene for Sample 1

e.sum$s1 <- aggregate(Sample1_ExonCount ~ Gene, data = e, sum)[, 2]

5. Divide summed exon size by summed exon counts for Sample 1

e.sum$div <- e.sum$Size / e.sum$s1

Using your example data, e.sum now looks like this:

  Gene Size  s1       div
1  ABC  215 399 0.5388471
2  PQR  161  22 7.3181818