A question of NGS requirement
3
0
Entering edit mode
10 days ago
zizigolu ★ 4.4k

Hi

Please, could somebody give me a hand about some infrustructure information?

storage RNAseq • 801 views
ADD COMMENT
2
Entering edit mode
10 days ago
GenoMax 154k

It surprisingly may not be that large in terms of size for prokaryotic samples.

Here is a table of storage size estimates from GPT for different types of prokaryptic transcriptomes for one sample (Typically 50–150 bp paired-end (e.g., 2 × 75 bp or 2 × 100 bp).

| Data type                             | Approx. reads        | Approx. file size (FASTQ, gzip-compressed) | Notes                                                                      |
| ------------------------------------- | -------------------- | ------------------------------------------ | -------------------------------------------------------------------------- |
| **Basic differential expression**     | 5–10 million reads   | ~0.5–2 GB                                  | Sufficient for most bacteria with small genomes (~3–6 Mb).                 |
| **High-depth transcriptome**          | 20–50 million reads  | ~2–10 GB                                   | Used for lowly expressed genes, isoform detection, or complex communities. |
| **Metatranscriptome / mixed culture** | 50–150 million reads | ~10–30 GB                                  | Needed to capture transcripts from many species.    

You can multiply that by the number of samples. You can double the storage to account for all downstream analysis files to manage everything comfortably.

ADD COMMENT
0
Entering edit mode

Thanks all

ADD REPLY
1
Entering edit mode

What kind of sequencing are you thinking of doing? Direct RNA or normal cDNA?

Assuming cDNA you may only need 10 cells (with barcoding and very deep sequencing) so roughly a total of ~ 2TB of sequence data. Less with lesser coverage.

| Sequencing depth per sample | 50 Gb flowcell (moderate) | 100 Gb flowcell (good) | 200 Gb flowcell (excellent) |
| --------------------------- | -------------- | --------------- | --------------- |
| 0.75 Gb (basic RNA-seq)     | ~65 samples    | ~130 samples    | ~260 samples    |
| 3 Gb (deep coverage)        | ~16 samples    | ~33 samples     | ~66 samples     |
| 7.5 Gb (very deep)          | ~6 samples     | ~13 samples     | ~26 samples     |

Direct RNA yields would be significantly smaller and barcoding support is lacking so you can only do apparently 4-5 samples per cell.

With polycistronic transctipts analysis may be trickier if you choose full length direct RNA sequencing.

ADD REPLY
0
Entering edit mode

so If we also keep the raw signal files (FAST5/POD5) for possible re-basecalling later, the 60–70 TB over the full project makes sense or exaggerative?

ADD REPLY
1
Entering edit mode

One can never have enough storage but that said the PromethION cells were more in the ~ 1.5 TB size in my experience. So the amount you mention should be significantly more than strictly needed.

ADD REPLY
0
Entering edit mode

Thanks a million.

ADD REPLY
1
Entering edit mode

You have not said if you are planning to sequence direct RNA or cDNA.

36 PromethION sounds like you intend to do direct RNA since that would be overkill for cDNA data for 250 samples. If you are sure about there being 36 PromethION flowcells total, then count about 1.5 TB for raw data per cell (as of now for cDNA, less for RNA). Projecting something 6 years out is difficult since the technology keeps moving rapidly, so adding 1.5-2x more to the estimate should be adequate for future improvements.

ADD REPLY
0
Entering edit mode

Thanks a lot, very helpful comment.

ADD REPLY
1
Entering edit mode

Sequencing could be done locally or by an external provider. If an external provider does the sequencing you will need to ask them to provide the entire data folder, if you intend to re-basecall, at later date.

Locally one could use a "P2 solo" accessory to run PromethION FC's (up to 2) with a beefy enough workstation (MinKNOW software works on windows as well as linux) or even a GridION sequencer.

While basecalling can be run on local computer during the run (assuming you have the right GPU and enough hardware), it may be simpler to allow the run to complete with FAST basecalling locally and then do high/super accuracy calling on external hardware (GPU(s)) using the raw data folder at any time later.

ADD REPLY
0
Entering edit mode

I just can say AI models I have tried could not give such a comprehensive answer. Thanks a lot once more

ADD REPLY
0
Entering edit mode
10 days ago

Depends how much you do with them, how good you are at being disciplined with storage, what size of file are initially generated, whether you want to keep all the aligned bams, how many reference genomes you use, if you do de novo assembly too, etc etc etc.

Constructive points

  • pigz - parallel gzip - is your friend
  • program in nextflow pipelines and delete tmp files in the work dir all the fime
  • use a tool like ncdu or dust to monitor disk usage very frequently
  • be disciplined
  • check ncbi etc for the range of file sizes per sample you might generate. Multiply by 250.

Consider also offline storage as backup, eg amazon glacier, even multiple copies on local external hard disks can be very cheap.

ADD COMMENT
0
Entering edit mode
10 days ago

It depends on the sequencing depth, transcriptome size, and how many intermediate files are stored/created during analysis.

Example:

Mycobacterium tuberculosis; RNA-Seq (SRR7444071)

Genome size: 4.5Mb

Number of reads: 2,393,077

File size: 129.1MB

A conservative estimate will be ~700GB to 1 TB at max, assuming none of the files are stored in an uncompressed format.

ADD COMMENT

Login before adding your answer.

Traffic: 5624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6