How to count the number of occurrences of 70,000 short sequences in fq.gz files containing 30 million sequences
1
0
Entering edit mode
3.0 years ago
mujiangxielu ▴ 10

I have a sgRNA library which contain 74000 sgRNA and 7 fq.gz files each containing about 30 million sequences of 150bp length. I want to count the number of occurrences of 74,000 short sequences in each fq.gz files and get the result table.

At first, I used grep in the shell to do this, but I didn't know how to loop through the sgRNA in the library and output the stats to the file, so I moved to R to do this.

I use ShortRead package and as.data.frame get the sequences in fq.gz file.

library(ShortRead)
BH0_1 <- readFastq("BH-0_1.clean.fq.gz")
head(sread(BH0_1), 3)
# DNAStringSet object of length 3:
#     width seq
# [1]   150 NTGTGGAAAGGACGAAACACCGACAGGTTGCCCCACGCGTTGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGGTTTTTTAATCTTTGCGTAACTAGATCTTGAGA
# [2]   150 NTGTGGAAAGGACGAAACACCGGTCAGGCAACGGAATCCCAGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAAGCTTGGCGTAACGAGATCTTGAGA
# [3]   150 NTGTGGAAAGGACGAAACACCGAACACACATATACACTAAGGGTTTTAGAGC...AAGTGGCACCGAGTCGGTGCTTTTTTAATCTTTGCGTAAATAGATCTTGAGA
length(sread(BH0_1))
# [1] 26350083

BH0_1 <- as.data.frame(sread(BH0_1))

library(readxl)
lib73178 <- read_excel("library 73178 target gene.xlsx")

At first I wrote a for loop to loop through

for(i in 1:length(lib73178$sgRNA_Target_Sequence))
{
lib73178$BH0_1[i] <- length(grep(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence[i],'.*TTTTTT'),BH0_1$x))
}

I found that this was too slow, so I replaced the for loop with the apply function

U6_perfect <- as.data.frame(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'))
count <- function(x,y){
return(length(grep(x,y)))
}
lib73178$BH01 <- apply(U6_perfect,1, count,y = BH0_1$x)

But this was still slow, so I used multi-threaded parallelism to speed up the task

library(parallel)
detectCores()
# 12


cores=12
lib73178$BH01c = mclapply(paste0('CGAAACACCG',lib73178$sgRNA_Target_Sequence,'.*TTTTTT'), count <- function(x){
return(length(grep(x,BH0_1$x)))
}, mc.cores = cores)

But that's still slow, I've been running for over 24 hours now and haven't finished counting the first file. Is there a better way to go about this task? Any helpful suggestions are greatly appreciated!

analysis sgRNA CRISPR • 1.0k views
ADD COMMENT
3
Entering edit mode
3.0 years ago
GenoMax 154k

Is there a better way to go about this task?

Use 2FAST2Q (LINK) to do this. You can use it on the command line or via a GUI (the program is fast, should take < 10 min).

Another program called MAGeCK (LINK) also does this counting efficiently.

ADD COMMENT
0
Entering edit mode

Thank you very much! I'll try it

ADD REPLY

Login before adding your answer.

Traffic: 3345 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6