I've a big file ( ~ 50M lines ) containing paired genomic positions like this (each line a paired position ):
chrA posA chrB posB
and I want to reduce this list of paired positions by regrouping paired genomic positions that are closed. For example
chr1 1000 chr8 5000 chr1 990 chr8 5030 chr1 1010 chr8 5010 chr5 500 chr10 1000
and after processing it becomes : (the last colum represent the number of lines supporting the paired position )
chr1 1000 chr8 5000 3 chr5 500 chr10 1000 1
any idea ? my first idea was to use a perl script with hash table but I'm a little concern about the size of the list.
FYI : the file is sorted by chrom and positions.