I have a sgRNA library which contain 74000 sgRNA and 7 fq.gz files each containing about 30 million sequences of 150bp length. I want to count the number of occurrences of 74,000 short sequences in each fq.gz files and get the result table.
At first, I used grep in the shell to do this, but I didn't know how to loop through the sgRNA in the library and output the stats to the file, so I moved to R to do this.
I use ShortRead package and as.data.frame get the sequences in fq.gz file.
library(ShortRead)
BH0_1 <- readFastq("BH-0_1.clean.fq.gz")
head(sread(BH0_1), 3)
# DNAStringSet object of length 3:
# width seq
# [1] 150 NTGTGGAAAGGACGAAACACCGACAGGTTGCCCCACGCGTTGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGGTTTTTTAATCTTTGCGTAACTAGATCTTGAGA
# [2] 150 NTGTGGAAAGGACGAAACACCGGTCAGGCAACGGAATCCCAGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAAGCTTGGCGTAACGAGATCTTGAGA
# [3] 150 NTGTGGAAAGGACGAAACACCGAACACACATATACACTAAGGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAATCTTTGCGTAAATAGATCTTGAGA
length(sread(BH0_1))
# [1] 26350083
BH0_1 <- as.data.frame(sread(BH0_1))
library(readxl)
lib73178 <- read_excel("library 73178 target gene.xlsx")
At first I wrote a for loop to loop through
for(i in 1:length(lib73178$sgRNA_Target_Sequence))
{
lib73178$BH0_1[i] <- length(grep(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence[i],'.*TTTTTT'),BH0_1$x))
}
I found that this was too slow, so I replaced the for loop with the apply function
U6_perfect <- as.data.frame(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'))
count <- function(x,y){
return(length(grep(x,y)))
}
lib73178$BH01 <- apply(U6_perfect,1, count,y = BH0_1$x)
But this was still slow, so I used multi-threaded parallelism to speed up the task
library(parallel)
detectCores()
# 12
cores=12
lib73178$BH01c = mclapply(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'), count <- function(x){
return(length(grep(x,BH0_1$x)))
}, mc.cores = cores)
But that's still slow, I've been running for over 24 hours now and haven't finished counting the first file. Is there a better way to go about this task? Any helpful suggestions are greatly appreciated!
Thank you very much! I'll try it