Is it possible to count lines of fastq.gz file in R ?
4
0
Entering edit mode
7 weeks ago
br0104 • 0

Hi, Im struggling with basic things,, please help me out

I want to count lines of fastq.gz files from RNA seq results in R studio.

If possible please let me know the function...

fastq.gz count fastq R • 568 views
ADD COMMENT
3
Entering edit mode
6 weeks ago
benformatics ★ 2.6k

If you are limited by memory this should work:

library(ShortRead)

file <- 'your_file.fastq.gz'

## set your stream chunk value - if you have more or less memory set the n and readerBlockSize value higher or lower
f <- FastqStreamer(file, n=100, readerBlockSize=1000) 
## initialize length
totalLength <- 0

while (length(fq <- yield(f)) ) {
  totalLength <- totalLength + length(fq)
}
close(f)

print(totalLength)
ADD COMMENT
1
Entering edit mode
7 weeks ago

I don't know if this is could be considered a solution "in R" since effectively it relies on system commands:

fastq <- 'reads.fastq.gz'
n_lines <- as.integer(system(sprintf('gzip -cd %s | wc -l', fastq), intern= TRUE))
ADD COMMENT
0
Entering edit mode

The advantage of this solution is that you're not actually reading in the fastq file, which could have your R choking if it's very large.

ADD REPLY
0
Entering edit mode

Thank you @dariober and @Friederike however, it seems 'gzip' doesn't work in R

n_lines <- as.integer(system(sprintf("gzip -cd %s | wc -l", fastq), intern = TRUE)) Error in system(sprintf("gzip -cd %s | wc -l", fastq), intern = TRUE) : 'gzip' not found

it shows this error...

ADD REPLY
0
Entering edit mode

try gunzip -c

n_lines <- as.integer(system(sprintf('gunzip -c %s | wc -l', fastq), intern= TRUE))

ADD REPLY
1
Entering edit mode
7 weeks ago
benformatics ★ 2.6k
library(Biostrings)
fq <- readDNAStringSet('your_file.fastq.gz',format='FASTQ')
length(fq)
ADD COMMENT
0
Entering edit mode

Thank you @benformatics but I found it takes a lot of memory so my computer's dead... does not respond when I run that code I should find another way...

ADD REPLY
0
Entering edit mode

Try this to decrease memory costs:

library(ShortRead)

file <- 'your_file.fastq.gz'

## set your stream chunk value - if you have more memory set the n value higher or lower
f <- FastqStreamer(file, n=100) 
## initialize length
totalLength <- 0

while (length(fq <- yield(f)) ) {
  totalLength <- totalLength + length(fq)
}
close(f)

print(totalLength)
ADD REPLY
0
Entering edit mode
7 weeks ago
Hood ▴ 10

You can use function readFastq from microseq package. It will save gzipped fasta and return a tibble. Number of rows in tibble will be a number of reads in fastq file. If you need to count a lines:

(n * k) + n

where n - number of rows, k - number of column (readFastq return tibble with 3 columns)

ADD COMMENT
0
Entering edit mode

Thank you @Hood Could you please tell me how to return the tibble using that function? I tried like (I'm not so sure if I did right)

fdta <- readFastq(fq.file) Error in fread(in.file, header = F, sep = "\t", data.table = F, quote = "") : Opened 25.67GB (27558532840 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.

it showed this error.. am I doing right?

ADD REPLY

Login before adding your answer.

Traffic: 1806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6