Question

Read genomic ranges from a very big file in R

1

Entering edit mode

6.3 years ago

bisansamara ▴ 20

I need to read point mutations from the 1000Genome data set (total of 84,801,880 rows), and check for overlap with another set of chromosome ranges using the GenomicRanges Package in R.

To do so I usually run the following code:

> x0 = read.csv ("NameOfFile1.csv") #read data from the first file (1000Genome in this case)
> x1 = read.csv ("NameOfFile2.csv") #read data from the second file

> library(GenomicRanges)
> gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
> gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

> hits = findOverlaps(gr0, gr1)
> hits

However, the first file is too big, and it's not possible to convert it to CSV. I tried converting it to txt file, but still I cannot read it in R. I used the following command:

  > x0 = read.table ("NameofFile1.txt",header=FALSE,sep=",",stringsAsFactors = FALSE,quote = "")
 Error: cannot allocate vector of size 1000.0 Mb

Is there any other way to check for overlaps between the two files? Your help is highly appreciated!

genome R GenomicRanges overlap • 4.3k views

ADD COMMENT • link 6.3 years ago by bisansamara ▴ 20

2

Entering edit mode

There are better ways than R to do this as others have pointed out. For R solution, you may try fread() from data.table package that is blazing fast for reading large files. It tries to guess the delimiter and header automatically. It will give you an object of class data.table, which is very similar to data.frame though quirky sometimes. You may set the argument data.table = F to get the usual dataframe object after the read is complete.

ADD REPLY • link 6.3 years ago by Santosh Anand 5.7k

0

Entering edit mode

I tried fread( ), it's much faster than read.csv. Thanks for suggesting that. I would like to use it with the same code I wrote above to read my file, but first I need to add headers to a BED file. Do you have an idea how I can do this?

ADD REPLY • link 6.3 years ago by bisansamara ▴ 20

0

Entering edit mode

try header=F explicitly in fread, and then add the headers in your code.

ADD REPLY • link 6.3 years ago by Santosh Anand 5.7k

1

Entering edit mode

Use BEDOPS bedops or bedmap to retrieve overlaps or associations between genomic intervals. It will scale to whole-genome scale datasets.

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

This is very easy to use. However, it requires that both files have genomic intervals. In my case, one of the files is the 1000Genome (point mutations=chr# and position; not an interval), and the other file has genomic intervals. Is there a way around to use bedops in this case?

ADD REPLY • link 6.3 years ago by bisansamara ▴ 20

1

Entering edit mode

Use awk to preprocess the 1000Genomes data into intervals. For examples, if the first three columns are chromosome, position, and ID, then you can do something like this:

$ awk -vOFS="\t" '{ print $1, $2, ($2+1), $3; }' pointMutations.txt | sort-bed - > pointMutations.bed

Then use pointMutations.bed with your other intervals file with bedops or bedmap.

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks! do u know how I can add a header to the file?

ADD REPLY • link 6.3 years ago by bisansamara ▴ 20

0

Entering edit mode

To which file? Headers get ignored (removed) when doing set operations, in any case. Is there a reason you need headers?

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

I used the command you suggested to preprocess the 1000Genomes data into intervals. So the file I have now has 3 columns, and I want to add the following header: (chr \t start \t stop) and then use the GenomicRanges package (use the code provided in my original question). This would be the perfect way to achieve what I want.

ADD REPLY • link 6.3 years ago by bisansamara ▴ 20

0

Entering edit mode

If your input is VCF, then you can skip awk and use vcf2bed (also in the BEDOPS kit):

$ vcf2bed < pointMutations.vcf > pointMutations.bed

Or if you don't want an intermediate file:

$ bedops --operations <(vcf2bed < pointMutations.vcf) otherIntervals.bed > answer.bed

Or likewise for bedmap:

$ bedmap --operations otherIntervals.bed <(vcf2bed < pointMutations.vcf) > answer.bed

The <(vcf2bed < pointMutations.vcf) part is what is called a process substitution. It creates a stream of BED-formatted intervals on the fly for consumption by bedops or bedmap, as if it was a regular file.

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Do you have a 32 bit machine or a 64 bit machine?

ADD REPLY • link 6.3 years ago by Ram 43k

0

Entering edit mode

I have a 64 bit machine

ADD REPLY • link 6.3 years ago by bisansamara ▴ 20

0

Entering edit mode

Have a look at all the import commands of the rtracklayer package.

ADD REPLY • link 6.3 years ago by Charles Plessy ★ 2.9k

score 1 · Answer 1 · 2017-12-27

1

Entering edit mode

6.3 years ago

Pierre Lindenbaum 161k

use bedtools intersect.

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This seems like a good option, especially that it allows VCF and BED files as input. However, I couldn't install it! I used the same command provided here, and I got the following error after running $make: fatal error: zlib.h: No such file or directory