Entering edit mode
9.0 years ago
SOHAIL
▴
410
Hi Everyone,
I have BED genome interval files (format: chr startsite endsite) of masking regions for human genome. I want to determine like how much ( amount in percentage) of the human genome is masked by those intervals in BED file?? Is there any quick way to check this??
Please share your useful suggestions...!
Thank you!
Say you are working with
hg19data and you have two files:hg19.extents.bed.hg19.masked.bedThe file
hg19.extents.bedmight look like:Whatever these files look like will depend on your genomic build and your data sources.
Given those two inputs, you can run the following one-liner with BEDOPS bedmap:
The file
answer.bedwill contain the extent interval for each chromosome, the number of bases of overlap by masked regions over that extent, and the fraction of extent covered by the masked region (some floating point value between 0 and 1).If you want to turn fractions into percentages, multiply by 100:
If you want the entire genome, not by chromosome, then you could extend the code in the
awkblock just a little:The file
answer.txtwill contain the fraction calculation for the entire genome. Multiply by 100 to get a percentage value.Note: A "sorted" BED file is a BED file sorted with BEDOPS sort-bed.
Note: Masked regions can be disjoint or overlapping, if all you want is the full extent to which masked intervals overlap the parent chromosome. They just need to be sorted. The
--bases-uniqoperator ensures counts are unique.Your post is more an answer than a comment ;-)
Although your approach will result in detailed information, wouldn't it be sufficient to merge intervals (in the case overlaps exist), sum up the sizes and divide that by the known size of the genome?
Say you want percentage of some subset of the genome, instead of the entire genome. This approach is general enough to be used for different reference sets, by changing
hg19.extents.bedto some other set of interest.Generally, you can also merge the masked elements (or whatever) to create a disjoint map set that is piped into the
bedmapstep, using the non-unique--basesto do counting:For the purposes of writing a pipeline, the performance of this second approach is perhaps a little better, but the first approach is perhaps more readable as a one-liner.
Thanks for clarifying!
Thank you very much @Alex Reynolds, and @ WouterDeCoster for your kind suggestions.. :) worked pretty well for me..