I need to read point mutations from the 1000Genome data set (total of 84,801,880 rows), and check for overlap with another set of chromosome ranges using the GenomicRanges Package in R.
To do so I usually run the following code:
> x0 = read.csv ("NameOfFile1.csv") #read data from the first file (1000Genome in this case)
> x1 = read.csv ("NameOfFile2.csv") #read data from the second file
> library(GenomicRanges)
> gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
> gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
> hits = findOverlaps(gr0, gr1)
> hits
However, the first file is too big, and it's not possible to convert it to CSV. I tried converting it to txt file, but still I cannot read it in R. I used the following command:
> x0 = read.table ("NameofFile1.txt",header=FALSE,sep=",",stringsAsFactors = FALSE,quote = "")
Error: cannot allocate vector of size 1000.0 Mb
Is there any other way to check for overlaps between the two files? Your help is highly appreciated!
There are better ways than R to do this as others have pointed out. For R solution, you may try
fread()fromdata.tablepackage that is blazing fast for reading large files. It tries to guess the delimiter and header automatically. It will give you an object of classdata.table, which is very similar todata.framethough quirky sometimes. You may set the argumentdata.table = Fto get the usualdataframeobject after the read is complete.I tried fread( ), it's much faster than read.csv. Thanks for suggesting that. I would like to use it with the same code I wrote above to read my file, but first I need to add headers to a BED file. Do you have an idea how I can do this?
try
header=Fexplicitly in fread, and then add the headers in your code.Use BEDOPS
bedopsorbedmapto retrieve overlaps or associations between genomic intervals. It will scale to whole-genome scale datasets.This is very easy to use. However, it requires that both files have genomic intervals. In my case, one of the files is the 1000Genome (point mutations=chr# and position; not an interval), and the other file has genomic intervals. Is there a way around to use bedops in this case?
Use
awkto preprocess the 1000Genomes data into intervals. For examples, if the first three columns are chromosome, position, and ID, then you can do something like this:Then use
pointMutations.bedwith your other intervals file withbedopsorbedmap.Thanks! do u know how I can add a header to the file?
To which file? Headers get ignored (removed) when doing set operations, in any case. Is there a reason you need headers?
I used the command you suggested to preprocess the 1000Genomes data into intervals. So the file I have now has 3 columns, and I want to add the following header: (chr \t start \t stop) and then use the GenomicRanges package (use the code provided in my original question). This would be the perfect way to achieve what I want.
If your input is VCF, then you can skip
awkand usevcf2bed(also in the BEDOPS kit):Or if you don't want an intermediate file:
Or likewise for
bedmap:The
<(vcf2bed < pointMutations.vcf)part is what is called a process substitution. It creates a stream of BED-formatted intervals on the fly for consumption bybedopsorbedmap, as if it was a regular file.Do you have a 32 bit machine or a 64 bit machine?
I have a 64 bit machine
Have a look at all the
importcommands of the rtracklayer package.