Question

How to count the number of occurrences of 70,000 short sequences in fq.gz files containing 30 million sequences

0

Entering edit mode

3.1 years ago

mujiangxielu ▴ 10

I have a sgRNA library which contain 74000 sgRNA and 7 fq.gz files each containing about 30 million sequences of 150bp length. I want to count the number of occurrences of 74,000 short sequences in each fq.gz files and get the result table.

At first, I used grep in the shell to do this, but I didn't know how to loop through the sgRNA in the library and output the stats to the file, so I moved to R to do this.

I use ShortRead package and as.data.frame get the sequences in fq.gz file.

library(ShortRead)
BH0_1 <- readFastq("BH-0_1.clean.fq.gz")
head(sread(BH0_1), 3)
# DNAStringSet object of length 3:
#     width seq
# [1]   150 NTGTGGAAAGGACGAAACACCGACAGGTTGCCCCACGCGTTGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGGTTTTTTAATCTTTGCGTAACTAGATCTTGAGA
# [2]   150 NTGTGGAAAGGACGAAACACCGGTCAGGCAACGGAATCCCAGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAAGCTTGGCGTAACGAGATCTTGAGA
# [3]   150 NTGTGGAAAGGACGAAACACCGAACACACATATACACTAAGGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAATCTTTGCGTAAATAGATCTTGAGA
length(sread(BH0_1))
# [1] 26350083

BH0_1 <- as.data.frame(sread(BH0_1))

library(readxl)
lib73178 <- read_excel("library 73178 target gene.xlsx")

At first I wrote a for loop to loop through

for(i in 1:length(lib73178$sgRNA_Target_Sequence))
{
lib73178$BH0_1[i] <- length(grep(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence[i],'.*TTTTTT'),BH0_1$x))
}

I found that this was too slow, so I replaced the for loop with the apply function

U6_perfect <- as.data.frame(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'))
count <- function(x,y){
return(length(grep(x,y)))
}
lib73178$BH01 <- apply(U6_perfect,1, count,y = BH0_1$x)

But this was still slow, so I used multi-threaded parallelism to speed up the task

library(parallel)
detectCores()
# 12


cores=12
lib73178$BH01c = mclapply(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'), count <- function(x){
return(length(grep(x,BH0_1$x)))
}, mc.cores = cores)

But that's still slow, I've been running for over 24 hours now and haven't finished counting the first file. Is there a better way to go about this task? Any helpful suggestions are greatly appreciated!

analysis sgRNA CRISPR • 1.1k views

ADD COMMENT • link updated 3.1 years ago by GenoMax 154k • written 3.1 years ago by mujiangxielu ▴ 10

score 3 · Answer 1 · 2022-10-23

3

Entering edit mode

3.1 years ago

GenoMax 154k

Is there a better way to go about this task?

Use 2FAST2Q (LINK) to do this. You can use it on the command line or via a GUI (the program is fast, should take < 10 min).

Another program called MAGeCK (LINK) also does this counting efficiently.

ADD COMMENT • link 3.1 years ago by GenoMax 154k

0

Entering edit mode

Thank you very much! I'll try it

ADD REPLY • link 3.1 years ago by mujiangxielu ▴ 10