Question

Very large fastq files from RNA-seq

0

Entering edit mode

3.4 years ago

ramyak1912 • 0

Hi all, I am trying to optimize an RNA-seq pipeline and I want to be able to estimate the RAM requirements for fastq files of different sizes. So far I have tested on files from typical rna-seq expeirments of ~30 to 40 million reads. I want to now test on much larger data where the file is close to 50gb in size .

I was wondering where I can obtain such files for testing. Can anyone point me to some publicly available datasets that have more number of sequences than what I have already done? Anything like >= 150 million reads would also be okay.

Thanks , RK

RNA-Seq sequencing • 1.6k views

ADD COMMENT • link written 3.4 years ago by ramyak1912 • 0

3

Entering edit mode

You can try to merge multiple samples to big fastq file. Use cat or zcat command.

ADD REPLY • link 3.4 years ago by MatthewP ★ 1.4k

0

Entering edit mode

Hi , Thanks for your reply. I used some samples from ENCODE which were ~30GB or ~200 million reads to run some jobs on aws batch. And I got Out OfMemory Errors. And I am aligning to the human genome. I used the same pipeline before for running a basic rna-seq experiment with 25 to 30 million reads and I didn't have any problems then.

Do you have any idea about what could be the problem?

ADD REPLY • link 3.4 years ago by ramyak1912 • 0

0

Entering edit mode

WIthout details on the pipeline this is impossible to answer. Please add comments via ADD COMMENT/REPLY to keep things organized.

ADD REPLY • link 3.4 years ago by ATpoint 81k

score 3 · Answer 1 · 2020-11-25

ENCODE has some samples with more than 100m reads in them. E.g. ENCSR000COU. When you talk about fastq size are you talking about compressed or uncompressed? Because compressed, even this 100m read sample is only ~ 8Gb.

BTW, with most of the standard pipelines (STAR -> featureCounts -> DEseq2 or salmon -> tximport -> DESeq2 or kalisto -> tximport -> DESeq2) memory usage scales with the size of the genome, and the number of reads doesn't make a difference.