Merging Genomic Segments Separated By Some Distance / Number Of Markers
1
4
Entering edit mode
11.5 years ago
Ryan D ★ 3.4k

I'm working with copy number variation (CNV) data if it helps visualize it at all. There are two file types.In the first file we have CNV data which (for simplicity's sake) is formatted like so:

chr1:100000-149000  numsnp=10  length=49000 sample1 startsnp=rs100 endsnp=rs149

chr1:150000-200000  numsnp=10  length=50000 sample1 startsnp=rs150 endsnp=rs200


In the above, each CNV in sample1 spans about 50k, but they are split. This sometimes happens if some intermediate probes didn't detect a copy number change.

There is another file which contains info on which probes/snps are in the file and looks like this:

Name    Chr     Position
rs100   1       100000
rs101   1       101000

...

rs200   1       200000


My goal is to merge CNVs that are separated either A) by some distance in the same sample or B) by some number of probes, as defined by the second file type. B is the better choice. Any tools or resources any of you might use to do this on a regular basis? Links or detailed instructions most appreciated.

Thanks, Rx

cnv merge perl • 3.9k views
1
Entering edit mode

Could you define more precisely what is meant by "merge CNVs"? Perhaps give an indication of what the final output should look like.

1
Entering edit mode

As Neil suggested try to reformulate your question, your current description appears to have insufficient details.

0
Entering edit mode

As Neil suggested try reformulating your question, your current description appears to have insufficient details.

0
Entering edit mode

OK, the final output here should look like: chr1:100000-200000 numsnp=21 length=100000 sample1 startsnp=rs100 endsnp=rs200

Assuming a gap of one SNP. I think I'm close to a perl solution but are there any bioinformatic tools that can merge adjacent segments separated by some number (or percentage) or markers designated by a third file type.

Rx

1
Entering edit mode
10.5 years ago

Merging CNVs that are separated by some distance is a task that could be handled by BEDTools. In particular refer to the mergeBed function.

"mergeBed combines overlapping or “book-ended” (that is, one base pair away) features in a feature file into a single feature which spans all of the combined features."

By default only features that are already overlapping will be merged. If you want to merge features that may be separated by some distance, it seems like using the -d option should work.

"-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged."

You will need to convert your current file format into BED format but that is trivial. Your second scenario is not as obvious but there are many features of BEDTools that allow for a variety of comparisons between two files containing coordinates.

BEDTools Manual