Question: How to group/merge genomic positions ?
0
gravatar for Nicolas Rosewick
5.3 years ago by
Belgium, Brussels
Nicolas Rosewick8.3k wrote:

Hi,

I've a big file ( ~ 50M lines ) containing paired genomic positions like this (each line a paired position ):

chrA    posA    chrB    posB

and I want to reduce this list of paired positions by regrouping paired genomic positions that are closed. For example

chr1    1000    chr8     5000
chr1    990     chr8     5030
chr1    1010    chr8     5010
chr5    500     chr10    1000

and after processing it becomes : (the last colum represent the number of lines supporting the paired position )

chr1    1000    chr8    5000    3
chr5    500    chr10    1000    1

any idea ? my first idea was to use a perl script with hash table but I'm a little concern about the size of the list. 

FYI : the file is sorted by chrom and positions.

thanks

group position genomic • 1.5k views
ADD COMMENTlink modified 5.3 years ago by Alex Reynolds29k • written 5.3 years ago by Nicolas Rosewick8.3k

I don't quite follow are you considering chr1 990, 1000 and 1010 as the same position in some sense?

ADD REPLYlink written 5.3 years ago by russhh4.7k

I know it's not the same position but there are quite close to each other. I might rephrase the question by regrouping positions that are in the same region.

ADD REPLYlink written 5.3 years ago by Nicolas Rosewick8.3k
2
gravatar for Alex Reynolds
5.3 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Use cutawk and sort-bed to split the matrix into two separate and sorted BED files called A and B.

The second file B has the same genomic coordinates as the first file A, but contains an ID value constructed from the matrix file's second pair of columns.

It is necessary to sort both files, because their output is not in the correct sort order required for the next two steps.

Use bedops to merge 10-bp padded, neighboring regions around A, and then strip the 10 bp padding to create a new file called C. (Adjust this padding based on your criterion for neighboring regions.)

Use bedmap to map B onto C. Apply the --echo-map-id and --count operators to get the element IDs and number of elements from B which overlap elements in C.

$ cut -f1,2 foo.mtx | awk '{print $0"\t"($2+1)}' | sort-bed - > A

$ cut -f1-4 foo.mtx | awk '{print $1"\t"$2"\t"($2+1)"\t"$3":"$4}' | sort-bed - > B

$ bedops --merge --range 10 A | bedops --everything --range -10 - > C

$ bedmap --echo --echo-map-id --count C B > D

The file D contains your answer, though you might need to do a bit more post-processing on mapped elements per reference element, to get it into your exact desired format. You can use awk, Perl or whatever to process each line independently, at that point; no hash tables needed here.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Alex Reynolds29k

Thanks Alex. bedops is an awesome and very powerfull tool. I have to deep more into it ! But what I need here is that both paired positions has to be close. Your suggestion only group positions from the first position of the pair.

So the ideal solution would be :

chr1    1000    chr8     5000
chr1    990     chr8     5030
chr1    1010    chr8     5010
chr1    1010    chrX     2000
chr5    500     chr10    1000

gives me 

chr1    1000    chr8     5000   3
chr1    1010    chrX     2000   1
chr5    500     chr10    1000   1

 

ADD REPLYlink written 5.3 years ago by Nicolas Rosewick8.3k
1

This is quite a bit different from your original question, so you might ask a new question and others can provide a better answer.

Nonetheless, maybe you can do a crosswise bedmapping: Padded-A bedmapped onto B, and Padded-B bedmapped onto A, where A is made up of pseudo-BED elements from the first two columns of your matrix, and B is made up of pseudo-BED elements from the second two columns of your matrix. Then compare results where mapped and reference pseudo-BED elements overlap in both conditions, and work your way back from that.

This seems doable, but you'll have to do some scripting work. Hopefully, the code snippets I wrote for you will help you get started.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Alex Reynolds29k

I'll try that. I'll let you know if I figure it out. Thanks

ADD REPLYlink written 5.3 years ago by Nicolas Rosewick8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1078 users visited in the last hour