Question

Is it possible to count lines of fastq.gz file in R ?

0

Entering edit mode

3.8 years ago

br0104 • 0

Hi, I'm struggling with basic things,, please help me out. I want to count lines of fastq.gz files from RNA seq results in R studio. If possible please let me know the function.

fastq R • 4.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 3.8 years ago by br0104 • 0

score 3 · Answer 1 · 2021-09-01

If you are limited by memory this should work:

library(ShortRead)

file <- 'your_file.fastq.gz'

## set your stream chunk value - if you have more or less memory set the n and readerBlockSize value higher or lower
f <- FastqStreamer(file, n=100, readerBlockSize=1000) 
## initialize length
totalLength <- 0

while (length(fq <- yield(f)) ) {
  totalLength <- totalLength + length(fq)
}
close(f)

print(totalLength)

score 1 · Answer 2 · 2021-08-31

1

Entering edit mode

3.8 years ago

dariober 15k

I don't know if this is could be considered a solution "in R" since effectively it relies on system commands:

fastq <- 'reads.fastq.gz'
n_lines <- as.integer(system(sprintf('gzip -cd %s | wc -l', fastq), intern= TRUE))

ADD COMMENT • link 3.8 years ago by dariober 15k

0

Entering edit mode

The advantage of this solution is that you're not actually reading in the fastq file, which could have your R choking if it's very large.

ADD REPLY • link 3.8 years ago by Friederike 9.0k

0

Entering edit mode

Thank you @dariober and @Friederike however, it seems 'gzip' doesn't work in R

n_lines <- as.integer(system(sprintf("gzip -cd %s | wc -l", fastq), intern = TRUE)) Error in system(sprintf("gzip -cd %s | wc -l", fastq), intern = TRUE) : 'gzip' not found

it shows this error...

ADD REPLY • link 3.8 years ago by br0104 • 0

0

Entering edit mode

try gunzip -c

n_lines <- as.integer(system(sprintf('gunzip -c %s | wc -l', fastq), intern= TRUE))

ADD REPLY • link 3.8 years ago by Friederike 9.0k

score 1 · Answer 3 · 2021-08-31

1

Entering edit mode

3.8 years ago

benformatics 4.1k

library(Biostrings)
fq <- readDNAStringSet('your_file.fastq.gz',format='FASTQ')
length(fq)

ADD COMMENT • link 3.8 years ago by benformatics 4.1k

0

Entering edit mode

Thank you @benformatics but I found it takes a lot of memory so my computer's dead... does not respond when I run that code I should find another way...

ADD REPLY • link 3.8 years ago by br0104 • 0

0

Entering edit mode

Try this to decrease memory costs:

library(ShortRead)

file <- 'your_file.fastq.gz'

## set your stream chunk value - if you have more memory set the n value higher or lower
f <- FastqStreamer(file, n=100) 
## initialize length
totalLength <- 0

while (length(fq <- yield(f)) ) {
  totalLength <- totalLength + length(fq)
}
close(f)

print(totalLength)

ADD REPLY • link 3.8 years ago by benformatics 4.1k

score 0 · Answer 4 · 2021-08-31

0

Entering edit mode

3.8 years ago

Hood ▴ 40

You can use function readFastq from microseq package. It will save gzipped fasta and return a tibble. Number of rows in tibble will be a number of reads in fastq file. If you need to count a lines:

(n * k) + n

where n - number of rows, k - number of column (readFastq return tibble with 3 columns)

ADD COMMENT • link 3.8 years ago by Hood ▴ 40

0

Entering edit mode

Thank you @Hood Could you please tell me how to return the tibble using that function? I tried like (I'm not so sure if I did right)

fdta <- readFastq(fq.file) Error in fread(in.file, header = F, sep = "\t", data.table = F, quote = "") : Opened 25.67GB (27558532840 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.

it showed this error.. am I doing right?

ADD REPLY • link 3.8 years ago by br0104 • 0