I'm working with copy number variation (CNV) data if it helps visualize it at all. There are two file types.In the first file we have CNV data which (for simplicity's sake) is formatted like so:
chr1:100000-149000 numsnp=10 length=49000 sample1 startsnp=rs100 endsnp=rs149
chr1:150000-200000 numsnp=10 length=50000 sample1 startsnp=rs150 endsnp=rs200
In the above, each CNV in sample1 spans about 50k, but they are split. This sometimes happens if some intermediate probes didn't detect a copy number change.
There is another file which contains info on which probes/snps are in the file and looks like this:
Name Chr Position
rs100 1 100000
rs101 1 101000
...
rs200 1 200000
My goal is to merge CNVs that are separated either A) by some distance in the same sample or B) by some number of probes, as defined by the second file type. B is the better choice. Any tools or resources any of you might use to do this on a regular basis? Links or detailed instructions most appreciated.
Thanks, Rx
Could you define more precisely what is meant by "merge CNVs"? Perhaps give an indication of what the final output should look like.
As Neil suggested try to reformulate your question, your current description appears to have insufficient details.
As Neil suggested try reformulating your question, your current description appears to have insufficient details.
OK, the final output here should look like: chr1:100000-200000 numsnp=21 length=100000 sample1 startsnp=rs100 endsnp=rs200
Assuming a gap of one SNP. I think I'm close to a perl solution but are there any bioinformatic tools that can merge adjacent segments separated by some number (or percentage) or markers designated by a third file type.
Rx