Hi
Please, could somebody give me a hand about some infrustructure information?
Hi
Please, could somebody give me a hand about some infrustructure information?
It surprisingly may not be that large in terms of size for prokaryotic samples.
Here is a table of storage size estimates from GPT for different types of prokaryptic transcriptomes for one sample (Typically 50–150 bp paired-end (e.g., 2 × 75 bp or 2 × 100 bp).
| Data type | Approx. reads | Approx. file size (FASTQ, gzip-compressed) | Notes |
| ------------------------------------- | -------------------- | ------------------------------------------ | -------------------------------------------------------------------------- |
| **Basic differential expression** | 5–10 million reads | ~0.5–2 GB | Sufficient for most bacteria with small genomes (~3–6 Mb). |
| **High-depth transcriptome** | 20–50 million reads | ~2–10 GB | Used for lowly expressed genes, isoform detection, or complex communities. |
| **Metatranscriptome / mixed culture** | 50–150 million reads | ~10–30 GB | Needed to capture transcripts from many species.
You can multiply that by the number of samples. You can double the storage to account for all downstream analysis files to manage everything comfortably.
Depends how much you do with them, how good you are at being disciplined with storage, what size of file are initially generated, whether you want to keep all the aligned bams, how many reference genomes you use, if you do de novo assembly too, etc etc etc.
Constructive points
Consider also offline storage as backup, eg amazon glacier, even multiple copies on local external hard disks can be very cheap.
It depends on the sequencing depth, transcriptome size, and how many intermediate files are stored/created during analysis.
Example:
Mycobacterium tuberculosis; RNA-Seq (SRR7444071)
Genome size: 4.5Mb
Number of reads: 2,393,077
File size: 129.1MB
A conservative estimate will be ~700GB to 1 TB at max, assuming none of the files are stored in an uncompressed format.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks all
What kind of sequencing are you thinking of doing? Direct RNA or normal cDNA?
Assuming cDNA you may only need 10 cells (with barcoding and very deep sequencing) so roughly a total of ~ 2TB of sequence data. Less with lesser coverage.
Direct RNA yields would be significantly smaller and barcoding support is lacking so you can only do apparently 4-5 samples per cell.
With polycistronic transctipts analysis may be trickier if you choose full length direct RNA sequencing.
so If we also keep the raw signal files (FAST5/POD5) for possible re-basecalling later, the 60–70 TB over the full project makes sense or exaggerative?
One can never have enough storage but that said the PromethION cells were more in the ~ 1.5 TB size in my experience. So the amount you mention should be significantly more than strictly needed.
Thanks a million.
You have not said if you are planning to sequence direct RNA or cDNA.
36 PromethION sounds like you intend to do direct RNA since that would be overkill for cDNA data for 250 samples. If you are sure about there being 36 PromethION flowcells total, then count about 1.5 TB for raw data per cell (as of now for cDNA, less for RNA). Projecting something 6 years out is difficult since the technology keeps moving rapidly, so adding 1.5-2x more to the estimate should be adequate for future improvements.
Thanks a lot, very helpful comment.
Sequencing could be done locally or by an external provider. If an external provider does the sequencing you will need to ask them to provide the entire data folder, if you intend to re-basecall, at later date.
Locally one could use a "P2 solo" accessory to run PromethION FC's (up to 2) with a beefy enough workstation (MinKNOW software works on windows as well as linux) or even a GridION sequencer.
While basecalling can be run on local computer during the run (assuming you have the right GPU and enough hardware), it may be simpler to allow the run to complete with FAST basecalling locally and then do high/super accuracy calling on external hardware (GPU(s)) using the raw data folder at any time later.
I just can say AI models I have tried could not give such a comprehensive answer. Thanks a lot once more