Problems in dividing genome to non-overlapping bins and count genes
1
0
Entering edit mode
7.7 years ago
milk841103 ▴ 10

Hello,

I was working on analyzing the distribution of CNV-affected genes in a genome, so I divided the genome into non-overlapping bins with same length and looking at the number of CNV-affected genes within each bin. the method I used was from this post: C: Finding gene density from reference genome using R As you can see, the GRange table containing location of all the bins had an extra column added showing number of genes (in my case, CNV-affected genes) in each bin. I then found the 10 bins that contains the most CNV-affected genes, extracted their coordinates and tried to ID the individual genes (from a list of CNV-affected genes that I generated) that locate in those bins for further analysis. the code I used to extract these genes was written in unix command from this post I posted: http://unix.stackexchange.com/questions/303809/select-multi-column-rows-based-on-ranges-specified-in-a-separate-file

however, I found that while there are 247 CNV-affected genes identified in these 10 bins using the first method in R, my unix command suggested otherwise as only 175 CNV-affected genes were found in the same 10 bins. my co-worker wrote a perl script for the same purpose and the results also showed 175 genes were found.

while using the R method, I found that in some bins the number of genes called by function countOverlaps was off by 1 when varifying using the subset function (both are mentioned in the first post) but I assume it was because genes crossing two bins, however i don't understand why the results I got from R and unix command could differ so much. Can anyone help explaining this issue? Thank you very much!!

genome R gene unix • 2.2k views
ADD COMMENT
0
Entering edit mode
7.7 years ago

BED or tab-delimited text files you work with via Unix likely use a half-open 0-based index. Grange objects use a closed 1-based index. So this could potentially create one-off errors if you're not adjusting coordinates before trying to do an apples-to-apples set operation.

ADD COMMENT
0
Entering edit mode

thank you for your answer! do you mind telling me how should I adjust coordinates? and also could this 1-off error make such huge defference? because the sums of genes from 10 bins calculated using R and unix/perl script were off by 50...

ADD REPLY
0
Entering edit mode

It will depend on how you count things, but generally, you might start with the following overview and decide if this is an issue. You could start by testing overlaps on a handful of ranges via R/grange and Unix/Perl, to see if you get the same or different answers with whatever procedures you're using. Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLY

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6