Hi
Please, could somebody give me a hand about some infrustructure information?
Hi
Please, could somebody give me a hand about some infrustructure information?
It surprisingly may not be that large in terms of size for prokaryotic samples.
Here is a table of storage size estimates from GPT for different types of prokaryptic transcriptomes for one sample (Typically 50–150 bp paired-end (e.g., 2 × 75 bp or 2 × 100 bp).
| Data type | Approx. reads | Approx. file size (FASTQ, gzip-compressed) | Notes |
| ------------------------------------- | -------------------- | ------------------------------------------ | -------------------------------------------------------------------------- |
| **Basic differential expression** | 5–10 million reads | ~0.5–2 GB | Sufficient for most bacteria with small genomes (~3–6 Mb). |
| **High-depth transcriptome** | 20–50 million reads | ~2–10 GB | Used for lowly expressed genes, isoform detection, or complex communities. |
| **Metatranscriptome / mixed culture** | 50–150 million reads | ~10–30 GB | Needed to capture transcripts from many species.
You can multiply that by the number of samples. You can double the storage to account for all downstream analysis files to manage everything comfortably.
Depends how much you do with them, how good you are at being disciplined with storage, what size of file are initially generated, whether you want to keep all the aligned bams, how many reference genomes you use, if you do de novo assembly too, etc etc etc.
Constructive points
Consider also offline storage as backup, eg amazon glacier, even multiple copies on local external hard disks can be very cheap.
It depends on the sequencing depth, transcriptome size, and how many intermediate files are stored/created during analysis.
Example:
Mycobacterium tuberculosis; RNA-Seq (SRR7444071)
Genome size: 4.5Mb
Number of reads: 2,393,077
File size: 129.1MB
A conservative estimate will be ~700GB to 1 TB at max, assuming none of the files are stored in an uncompressed format.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks all
What kind of sequencing are you thinking of doing? Direct RNA or normal cDNA?
Assuming cDNA you may only need 10 cells (with barcoding and very deep sequencing) so roughly a total of ~ 2TB of sequence data. Less with lesser coverage.
Direct RNA yields would be significantly smaller and barcoding support is lacking so you can only do apparently 4-5 samples per cell.
With polycistronic transctipts analysis may be trickier if you choose full length direct RNA sequencing.
so If we also keep the raw signal files (FAST5/POD5) for possible re-basecalling later, the 60–70 TB over the full project makes sense or exaggerative?
One can never have enough storage but that said the PromethION cells were more in the ~ 1.5 TB size in my experience. So the amount you mention should be significantly more than strictly needed.