Read genomic ranges from a very big file in R
1
1
Entering edit mode
4.4 years ago
bisansamara ▴ 20

I need to read point mutations from the 1000Genome data set (total of 84,801,880 rows), and check for overlap with another set of chromosome ranges using the GenomicRanges Package in R.

To do so I usually run the following code:

> x0 = read.csv ("NameOfFile1.csv") #read data from the first file (1000Genome in this case)

> library(GenomicRanges)
> gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
> gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

> hits = findOverlaps(gr0, gr1)
> hits


However, the first file is too big, and it's not possible to convert it to CSV. I tried converting it to txt file, but still I cannot read it in R. I used the following command:

  > x0 = read.table ("NameofFile1.txt",header=FALSE,sep=",",stringsAsFactors = FALSE,quote = "")
Error: cannot allocate vector of size 1000.0 Mb


Is there any other way to check for overlaps between the two files? Your help is highly appreciated!

genome R GenomicRanges overlap • 3.0k views
2
Entering edit mode

There are better ways than R to do this as others have pointed out. For R solution, you may try fread() from data.table package that is blazing fast for reading large files. It tries to guess the delimiter and header automatically. It will give you an object of class data.table, which is very similar to data.frame though quirky sometimes. You may set the argument data.table = F to get the usual dataframe object after the read is complete.

0
Entering edit mode

I tried fread( ), it's much faster than read.csv. Thanks for suggesting that. I would like to use it with the same code I wrote above to read my file, but first I need to add headers to a BED file. Do you have an idea how I can do this?

0
Entering edit mode

try header=F explicitly in fread, and then add the headers in your code.

1
Entering edit mode

Use BEDOPS bedops or bedmap to retrieve overlaps or associations between genomic intervals. It will scale to whole-genome scale datasets.

0
Entering edit mode

This is very easy to use. However, it requires that both files have genomic intervals. In my case, one of the files is the 1000Genome (point mutations=chr# and position; not an interval), and the other file has genomic intervals. Is there a way around to use bedops in this case?

1
Entering edit mode

Use awk to preprocess the 1000Genomes data into intervals. For examples, if the first three columns are chromosome, position, and ID, then you can do something like this:

$awk -vOFS="\t" '{ print$1, $2, ($2+1), $3; }' pointMutations.txt | sort-bed - > pointMutations.bed  Then use pointMutations.bed with your other intervals file with bedops or bedmap. ADD REPLY 0 Entering edit mode Thanks! do u know how I can add a header to the file? ADD REPLY 0 Entering edit mode To which file? Headers get ignored (removed) when doing set operations, in any case. Is there a reason you need headers? ADD REPLY 0 Entering edit mode I used the command you suggested to preprocess the 1000Genomes data into intervals. So the file I have now has 3 columns, and I want to add the following header: (chr \t start \t stop) and then use the GenomicRanges package (use the code provided in my original question). This would be the perfect way to achieve what I want. ADD REPLY 0 Entering edit mode If your input is VCF, then you can skip awk and use vcf2bed (also in the BEDOPS kit): $ vcf2bed < pointMutations.vcf > pointMutations.bed


Or if you don't want an intermediate file:

$bedops --operations <(vcf2bed < pointMutations.vcf) otherIntervals.bed > answer.bed  Or likewise for bedmap: $ bedmap --operations otherIntervals.bed <(vcf2bed < pointMutations.vcf) > answer.bed


The <(vcf2bed < pointMutations.vcf) part is what is called a process substitution. It creates a stream of BED-formatted intervals on the fly for consumption by bedops or bedmap, as if it was a regular file.

0
Entering edit mode

Do you have a 32 bit machine or a 64 bit machine?

0
Entering edit mode

I have a 64 bit machine

0
Entering edit mode

Have a look at all the import commands of the rtracklayer package.

1
Entering edit mode
4.4 years ago

use bedtools intersect.

0
Entering edit mode

This seems like a good option, especially that it allows VCF and BED files as input. However, I couldn't install it! I used the same command provided here, and I got the following error after running \$make: fatal error: zlib.h: No such file or directory

0
Entering edit mode

So you should install zlib for your OS.