Question: Read genomic ranges from a very big file in R
0
gravatar for bisansamara
27 days ago by
bisansamara10
bisansamara10 wrote:

I need to read point mutations from the 1000Genome data set (total of 84,801,880 rows), and check for overlap with another set of chromosome ranges using the GenomicRanges Package in R.

To do so I usually run the following code:

> x0 = read.csv ("NameOfFile1.csv") #read data from the first file (1000Genome in this case)
> x1 = read.csv ("NameOfFile2.csv") #read data from the second file

> library(GenomicRanges)
> gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
> gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

> hits = findOverlaps(gr0, gr1)
> hits

However, the first file is too big, and it's not possible to convert it to CSV. I tried converting it to txt file, but still I cannot read it in R. I used the following command:

  > x0 = read.table ("NameofFile1.txt",header=FALSE,sep=",",stringsAsFactors = FALSE,quote = "")
 Error: cannot allocate vector of size 1000.0 Mb

Is there any other way to check for overlaps between the two files? Your help is highly appreciated!

overlap genomicranges R genome • 277 views
ADD COMMENTlink modified 19 days ago • written 27 days ago by bisansamara10
1

Use BEDOPS bedops or bedmap to retrieve overlaps or associations between genomic intervals. It will scale to whole-genome scale datasets.

ADD REPLYlink modified 27 days ago • written 27 days ago by Alex Reynolds22k

This is very easy to use. However, it requires that both files have genomic intervals. In my case, one of the files is the 1000Genome (point mutations=chr# and position; not an interval), and the other file has genomic intervals. Is there a way around to use bedops in this case?

ADD REPLYlink written 19 days ago by bisansamara10
1

Use awk to preprocess the 1000Genomes data into intervals. For examples, if the first three columns are chromosome, position, and ID, then you can do something like this:

$ awk -vOFS="\t" '{ print $1, $2, ($2+1), $3; }' pointMutations.txt | sort-bed - > pointMutations.bed

Then use pointMutations.bed with your other intervals file with bedops or bedmap.

ADD REPLYlink written 19 days ago by Alex Reynolds22k

Thanks! do u know how I can add a header to the file?

ADD REPLYlink written 17 days ago by bisansamara10

To which file? Headers get ignored (removed) when doing set operations, in any case. Is there a reason you need headers?

ADD REPLYlink written 17 days ago by Alex Reynolds22k

I used the command you suggested to preprocess the 1000Genomes data into intervals. So the file I have now has 3 columns, and I want to add the following header: (chr \t start \t stop) and then use the GenomicRanges package (use the code provided in my original question). This would be the perfect way to achieve what I want.

ADD REPLYlink written 17 days ago by bisansamara10

If your input is VCF, then you can skip awk and use vcf2bed (also in the BEDOPS kit):

$ vcf2bed < pointMutations.vcf > pointMutations.bed

Or if you don't want an intermediate file:

$ bedops --operations <(vcf2bed < pointMutations.vcf) otherIntervals.bed > answer.bed

Or likewise for bedmap:

$ bedmap --operations otherIntervals.bed <(vcf2bed < pointMutations.vcf) > answer.bed

The <(vcf2bed < pointMutations.vcf) part is what is called a process substitution. It creates a stream of BED-formatted intervals on the fly for consumption by bedops or bedmap, as if it was a regular file.

ADD REPLYlink written 19 days ago by Alex Reynolds22k
1

There are better ways than R to do this as others have pointed out. For R solution, you may try fread() from data.table package that is blazing fast for reading large files. It tries to guess the delimiter and header automatically. It will give you an object of class data.table, which is very similar to data.frame though quirky sometimes. You may set the argument data.table = F to get the usual dataframe object after the read is complete.

ADD REPLYlink written 20 days ago by Santosh Anand3.3k

I tried fread( ), it's much faster than read.csv. Thanks for suggesting that. I would like to use it with the same code I wrote above to read my file, but first I need to add headers to a BED file. Do you have an idea how I can do this?

ADD REPLYlink written 15 days ago by bisansamara10

try header=F explicitly in fread, and then add the headers in your code.

ADD REPLYlink written 12 days ago by Santosh Anand3.3k

Do you have a 32 bit machine or a 64 bit machine?

ADD REPLYlink written 27 days ago by Ram13k

I have a 64 bit machine

ADD REPLYlink written 27 days ago by bisansamara10

Have a look at all the import commands of the rtracklayer package.

ADD REPLYlink written 26 days ago by Charles Plessy2.4k
1
gravatar for Pierre Lindenbaum
27 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum103k wrote:

use bedtools intersect.

ADD COMMENTlink written 27 days ago by Pierre Lindenbaum103k

This seems like a good option, especially that it allows VCF and BED files as input. However, I couldn't install it! I used the same command provided here, and I got the following error after running $make: fatal error: zlib.h: No such file or directory

ADD REPLYlink written 19 days ago by bisansamara10

So you should install zlib for your OS.

ADD REPLYlink written 15 days ago by genomax40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1002 users visited in the last hour