Question

Crawling Over 1648 Million Line For Phastcons Scores

4

Entering edit mode

13.7 years ago

User 0968 ▴ 40

I got a file which contain phastCons for every position in cow. It's more than 1648 million lines. What's the best approach to the phastCons for the positions i want ?

file start with header and each line is like 0.043 which is the phastCons for positon (Start mention in header + steps*lineNumber)

• 2.5k views

ADD COMMENT • link updated 13.3 years ago by biobot 0.0.77.a.1099 6.2k • written 13.7 years ago by User 0968 ▴ 40

0

Entering edit mode

is it this kind of file ? http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/phastCons5way/

ADD REPLY • link 13.7 years ago by Pierre Lindenbaum 161k

Ram · Answer 1 · 2010-08-08

If this file is something like http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/phastCons5way/, then to find the data on chr1 between 13772 and 23872 I would use something like the following awk script:

/^fixedStep/    {
    split($0,a,"[ \t]+");
    split(a[2],s2,"=");
    split(a[3],s3,"=");
    split(a[4],s4,"=");
    chrom=s2[2];
    start=int(s3[2]);
    step=int(s4[2]);
    pos=start-step;

    next;
    }

    {
    pos+=step;
    if(chrom=="chr1" && pos>=13772 && pos<=23872)
        {
        printf("%s %d\t%s\n",chrom,pos,$1);
        }
    }

Command line:

curl -s  http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/phastCons5way/file_0.data.gz |\
gunzip -c |\
awk -f ~/wig.awk
chr1 13902      0.007
chr1 13903      0.013
chr1 13904      0.017
chr1 13905      0.020
chr1 13906      0.022
chr1 13907      0.022
chr1 13908      0.022
chr1 13909      0.025
chr1 13910      0.027
chr1 13911      0.028
chr1 13912      0.027
chr1 13913      0.025
chr1 13914      0.023
chr1 13915      0.024
(...)

If speed is an issue, you can store this kind of data in a sorted binary file(or a key/value file) and access this data with a (e.g.) C program.

score 5 · Answer 2 · 2010-08-08

These files look like UCSC fixedstep wiggle format files. "Crawling" suggests to me a stream-based approach, reading one line at a time in a scripting language and processing each data point, before discarding it. This is fine if you want to do this one time only. Your data are large enough that you don't want to make multiple passes through it.

If you need to make multiple passes, or to extract sub-regions quickly or repeatedly, then convert to an indexed binary format. Tools exist already to do this for wiggle files. Convert the file to bigwig format using the wigToBigWig program described in the bigwig paper. You will also need the chromosome and/or contig sizes for your data to do this. Then you can access intervals via the C API described in the paper, or there's a Perl wrapper, if Perl is your thing.