How to group/merge genomic positions ?
1
0
Entering edit mode
9.7 years ago

Hi,

I've a big file ( ~ 50M lines ) containing paired genomic positions like this (each line a paired position ):

chrA    posA    chrB    posB

and I want to reduce this list of paired positions by regrouping paired genomic positions that are closed. For example

chr1    1000    chr8     5000
chr1    990     chr8     5030
chr1    1010    chr8     5010
chr5    500     chr10    1000

and after processing it becomes: (the last column represent the number of lines supporting the paired position)

chr1    1000    chr8    5000    3
chr5    500    chr10    1000    1

Any ideas? My first idea was to use a perl script with hash table but I'm a little concern about the size of the list.

FYI : the file is sorted by chrom and positions.

thanks

group position genomic • 2.4k views
ADD COMMENT
0
Entering edit mode

I don't quite follow are you considering chr1 990, 1000 and 1010 as the same position in some sense?

ADD REPLY
0
Entering edit mode

I know it's not the same position but there are quite close to each other. I might rephrase the question by regrouping positions that are in the same region.

ADD REPLY
2
Entering edit mode
9.7 years ago

Use cut, awk and sort-bed to split the matrix into two separate and sorted BED files called A and B.

The second file B has the same genomic coordinates as the first file A, but contains an ID value constructed from the matrix file's second pair of columns.

It is necessary to sort both files, because their output is not in the correct sort order required for the next two steps.

Use bedops to merge 10-bp padded, neighboring regions around A, and then strip the 10 bp padding to create a new file called C. (Adjust this padding based on your criterion for neighboring regions.)

Use bedmap to map B onto C. Apply the --echo-map-id and --count operators to get the element IDs and number of elements from B which overlap elements in C.

$ cut -f1,2 foo.mtx | awk '{print $0"\t"($2+1)}' | sort-bed - > A
$ cut -f1-4 foo.mtx | awk '{print $1"\t"$2"\t"($2+1)"\t"$3":"$4}' | sort-bed - > B
$ bedops --merge --range 10 A | bedops --everything --range -10 - > C
$ bedmap --echo --echo-map-id --count C B > D

The file D contains your answer, though you might need to do a bit more post-processing on mapped elements per reference element, to get it into your exact desired format. You can use awk, Perl or whatever to process each line independently, at that point; no hash tables needed here.

ADD COMMENT
0
Entering edit mode

Thanks Alex. bedops is an awesome and very powerful tool. I have to deep more into it ! But what I need here is that both paired positions has to be close. Your suggestion only group positions from the first position of the pair.

So the ideal solution would be :

chr1    1000    chr8     5000
chr1    990     chr8     5030
chr1    1010    chr8     5010
chr1    1010    chrX     2000
chr5    500     chr10    1000

gives me

chr1    1000    chr8     5000   3
chr1    1010    chrX     2000   1
chr5    500     chr10    1000   1
ADD REPLY
1
Entering edit mode

This is quite a bit different from your original question, so you might ask a new question and others can provide a better answer.

Nonetheless, maybe you can do a crosswise bedmapping: Padded-A bedmapped onto B, and Padded-B bedmapped onto A, where A is made up of pseudo-BED elements from the first two columns of your matrix, and B is made up of pseudo-BED elements from the second two columns of your matrix. Then compare results where mapped and reference pseudo-BED elements overlap in both conditions, and work your way back from that.

This seems doable, but you'll have to do some scripting work. Hopefully, the code snippets I wrote for you will help you get started.

ADD REPLY
0
Entering edit mode

I'll try that. I'll let you know if I figure it out. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6