Fastest Way Of Extracting Millions Of Short Sequences From The Human Genome
6
7
Entering edit mode
11.6 years ago
Aaron Statham ★ 1.1k

Hi,

I have ~ 40 million short read mapping locations which I want to retrieve the reference genome sequence of (human) - so I have a table of chromosome, start, end and strand. At the moment I am using the BSgenome getSeq function on 500k chunks at a time, but it is taking a few hours to get through all 40 million - does anyone have a faster solution? (linux command line friendly)

Thanks, Aaron

EDIT - Here's my current R code, which reads in a csv with co-ordinates 500k lines at a time, and writes out the same csv with two extra columns containing the forward and reverse sequence - it took about 6 hrs to finish an 18 million line csv on a 8-core 64 bit linux box with 16gb ram

infile <- commandArgs(TRUE)[1]
outfile <- commandArgs(TRUE)[2]
width <- as.integer(commandArgs(TRUE)[3])
chunkSize <- 500000

require(BSgenome.Hsapiens.UCSC.hg18)
f1 <- file(infile, open="rt")
fOut <- file(outfile, open="wt")
repeat {
temp <- as.data.frame(do.call(rbind, strsplit(readLines(f1, n=chunkSize), split=",")), stringsAsFactors=FALSE)
if (nrow(temp)==0) break
colnames(temp) <- c("id","chr","fw","rv","strand", "MM")
temp$fw <- as.integer(temp$fw)
temp$rv <- as.integer(temp$rv)
temp$fwSeq=getSeq(Hsapiens, temp$chr, temp$fw, width=width, strand=temp$strand)
temp$rvSeq=getSeq(Hsapiens, temp$chr, temp$rv, width=width, strand=ifelse(temp$strand=="+", "-", "+"))

writeLines(gsub(" ","",apply(temp, 1, paste, collapse=",")), fOut)
}

close(f1)
close(fOut)

sequence bioconductor r • 7.4k views
0
Entering edit mode

Hi Aaron, can you give us some code example to optimize, also what hardware are you using? For me, getting 1 million sequences out of Celegans takes about 3 sec, but that might depend on genome size.

0
Entering edit mode

Hmm hardware is definitely not an issue - however this is code I wrote a while ago so I'm not too sure why I only have it sucking in 500k at a time, R should handle an 18M line csv with 16gb RAM, however my awk solution I posted below will probably beat the pants off BSgenome any day I guess

0
Entering edit mode

I just tried reading the whole csv in, running and writing out the csv - It took about half an hour, so not much longer than the awk solution i posted below, but used ~14gb of RAM so I wouldn't have any wiggle room for larger files (which I will need in the future) so I'll stick with awk (even though its ugly and hard to debug)

0
Entering edit mode

Dont have to stick with it, it's just a matter of the io, let's beat awk...

8
Entering edit mode
11.6 years ago

Have you tried bedtools? I've never used that specific feature, but from what I hear, they're all pretty fast. Specifically, I think you're looking for the fastaFromBed command.

3
Entering edit mode

Other suggestions: If you have access to a cluster, this problem is really easy to split, along either chromosomes, or in smaller chunks if you have more nodes. I'll also mention that BSgenome seemed pretty slow last time I tried to use it. Got a machine with a decent amount of memory? If so, you might be better off just rolling your own - read a whole chromosome's sequence into memory, then slice out the chunks you need (using Perl|Python|Ruby)

0
Entering edit mode

Yeah I ended up writing an awk script to grab the sequences (one chromosome at a time) and formatting them exactly how I wanted them - took about 20 mins to do the 40M, major speedup from R, and also uses barely any RAM (few hundred megs)

I haven't tested bedtools though, it may be faster

8
Entering edit mode
11.4 years ago

Hi Aaron, Michael,

FYI, I've made some improvements to the getSeq() function in BSgenome and to the write.XStringSet() function in Biostrings. With the latest devel() versions of Biostrings (2.17.20) and BSgenome (1.17.5), it takes less than 5 min and 2.5GB of RAM to extract 46 millions short sequences from the Human genome and write them to a FASTA file.

First generate 46 million short read mapping locations:

library(BSgenome.Hsapiens.UCSC.hg18)
{
start <- lapply(seqnames(x),
function(name)
{
seqlength <- seqlengths(x)[name]
seqlength * density,
replace=TRUE)
})
names <- rep.int(seqnames(x), elementLengths(start))
strand <- strand(sample(c("+", "-"), length(names), replace=TRUE))
GRanges(seqnames=names, ranges=ranges, strand=strand)
}


Then extract the corresponding sequences by chunks of 2 millions and write them to a FASTA file (don't forget to specify as.character=FALSE when calling getSeq() or you will blow out R!):

breakInChunks <- function(totalsize, chunksize)
{
quot <- totalsize %/% chunksize
ans_width <- rep.int(chunksize, quot)
rem <- totalsize %% chunksize
if (rem > 0L)
ans_width <- c(ans_width, rem)
IRanges(end=cumsum(ans_width), width=ans_width)
}
extractAndWriteSequencesToFASTA <- function(x, locs, file)
{
chunksize <- 2000000
chunks <- breakInChunks(length(locs), chunksize)
for (i in seq_len(length(chunks))) {
chunk_start <- start(chunks)[i]
chunk_end <- end(chunks)[i]
cat("Extracting chunk ", i, " / ", length(chunks),
" (", chunk_start, " to ", chunk_end, ") ... ", sep="")
seqs <- getSeq(x, locs[chunk_start:chunk_end], as.character=FALSE)
cat("OK. Writing to file ... ")
write.XStringSet(seqs, file=file, append=TRUE)
cat("OK.\n")
}
}
extractAndWriteSequencesToFASTA(Hsapiens, locs, "test.fa")


Hope this helps! -- Hervé

3
Entering edit mode
11.6 years ago
Aaron Statham ★ 1.1k

I ended up using awk to get the job done in about 20 mins for the same 18 million line file.

First an awk script to read a chromosomes fasta file into a single line, with no header and no line breaks - this is called readChr.awk

NR==1{next} {
printf("%s",$0) } END {printf("\n")}  Now I sort the reads by chromosome and position, then process the reads through awk which will load each chromosome as it is encountered, then use substr() to grab the 75bp for each forward and reverse read sort --field-separator="," -k2,2 -k3,3n infile.csv | awk 'BEGIN {FS=","} function revComp(temp) { for(i=length(temp);i>=1;i--) { tempChar = substr(temp,i,1) if (tempChar=="A") {printf("T")} else if (tempChar=="C") {printf("G")} else if (tempChar=="G") {printf("C")} else if (tempChar=="T") {printf("A")} else {printf("N")} } } { #entry point if ($2!=chr) { #if hit a new chromosome, read it into chrSeq
chr=$2 "awk -f readChr.awk "chr".fa" | getline chrSeq } FW=toupper(substr(chrSeq,$3,75)) #retrieve forward sequence (75bp)
RV=toupper(substr(chrSeq,$4,75)) #retrieve reverse sequence (75bp) printf("%s,%s,%s,%s,%s,%s,",$1,$2,$3,$4,$5,$6) if ($5=="+") {
printf("%s,",FW)
revComp(RV)
printf("\n")
} else {
revComp(FW)
printf(",%s\n",RV)
}
}' > outfile.csv

0
Entering edit mode

+1 for using gawk

2
Entering edit mode
11.6 years ago

The tools developped by Jim Kent are in http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/utils/

for example see nibFrag or twoBitToFa, both are using the sequences indexed for blat.

for example twoBitToFa as an option -bed

twoBitToFa - Convert all or part of .2bit file to fasta
(...)
-bed=input.bed - grab sequences specified by input.bed. Will exclude introns
(...)


Second option: for each chromosome, put the whole sequence in memory, parse your mapping file and print out the substring(start,end)

0
Entering edit mode

Good idea about the UCSC tools, sadly they're deadly slow (it looks like it would take more than a day to do the 40M)

Hopefully someone can come up with a ready-make second option, otherwise I'll have to make it myself

0
Entering edit mode

twoBitToFa is too slow in it's current version. This command line does one million in a minute:

python -m twobitreader hg19.2bit < temp.bed


2
Entering edit mode
11.6 years ago

Hi Aaron thanks for putting the example code. What can be optimized here has nothing to do with BSgenome, it's the way you handle the IO by using read.lines and then converting into a dataframe and writing out the sequences using write.lines that takes up the time. Try this:

# generate 1M random sequences
start = runif(1000000, min=1, max =length(Hsapiens\$chr1)-200)
width = runif(1000000, min=30, max=200)
system.time(getSeq(Hsapiens, "chr1", start=start, width=width))
user  system elapsed
8.236   0.572   8.750


On a Macbook pro with 2 GB,.

You should have enough memory to read in the whole 40 Mil, at least you can try. For reading very large datasets into R you should better use the scan function. That is much more efficient than read.table or read.lines, and do not convert to a data.frame. These can take up 10 times a much memory.

Here is a code-example of reading in your data efficiently using scan:

name.col = 1;
chrom.col = 2;
# .....
# given you have the column number n.col and the column numbers
what = as.list(rep(NULL, n.col))
what[[name.col]] = character(0) # if your match has a name
what[[chrom.col]] = character(0) #
what[[start.col]] = integer(0)
what[[end.col]] = integer(0)
what[[strand.col]] = character(0)
# if you want to chunk this, use the skip and nmax parameters in a loop
reads = scan (filename, what=what, sep=",", skip=0, nmax=-1)



In fact, this should be even faster than all other solutions presented here.

1
Entering edit mode

A small tip: with the latest R's version, you can read compressed files directly, with scan or read. This has been proven to be -faster- than reading a non-compressed file. ref: http://blog.revolution-computing.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

0
Entering edit mode

Very nice improvement! If you can chose how to save the result, e.g. to save for further analysis in R, it might be best to save in R's binary format using save and load.

0
Entering edit mode

Thanks for the detailed reply! However, I don't see any time difference between using read.csv() or scan() on my actual file (both take ~6.5 mins to read in the 18 million lines). And also using R to grab the 18x2 million reads gobbles up 12gb of RAM, and takes 12 mins, then writing out the results takes another ~5 mins.

So timewise, R is as good as awk, however my awk script doesn't hit the RAM so hard (I think it peaks at ~4gb, though I haven't measured it properly yet) which means I don't have to worry about what other scripts I am running on this box at the same time. Cheers though!

0
Entering edit mode
5.3 years ago

You can use the twoBitReader python module:

python -m twobitreader hg19.2bit < temp.bed