Question: Comparing methylation data - Data cleaning and efficient code question
gravatar for startup_biostar
4.2 years ago by
startup_biostar10 wrote:

I have a data_1 which is in text format with columns, chr (representing chromosome number), stable_id, start, end & methylation. This is in txt format, mm9 version.

I have a data_2 which is in bigwig format with columns, seqnames, ranges, strand, methylation score. This is in mm10 format. (over 10 million rows)

I am to compare the data_1$start, data_1$end with data_2$ranges and compute the average methylation score and number of CpG islands.

Steps I followed which I believe is a long route.

  1. Step:1 - Converted data_1 to a file format like 'chrN:start-end' and exported the CSV .
  2. Step:2 - Used this CSV file, uploaded to ucsc genome browser LiftOver tool, converted from mm9 to mm10 - Output was a bed file.
  3. Step:3 - Replaced the start and end of data_1 file with new start and end coordinates of the liftovered output bed file.
  4. Step: 4- Comparing the start and end of data_1 with data_2, This is where I am stuck, takes a lot of time using R to process. IS there a simpler way than what I followed?

New to field. Please explain in steps.

sequencing R genome • 1.2k views
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by startup_biostar10
gravatar for PoGibas
4.2 years ago by
PoGibas4.8k wrote:

Welcome to Biostars.
Please see my answer: A: findOverlaps function in R
Here I use foverlaps from the data.table package. It is fast and should give what you want. If there are still problems please edit your question and we will help.
Basically you want to:

setkey(data_1, chr, start, end)  
setkey(data_2, chr, start, end)  
foverlaps(data_1, data_2)

Just friendly suggestion: don't name objects like data_1, use data1 instead. See Google's R Style Guide

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by PoGibas4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 799 users visited in the last hour