Hello All, I'm looking for a large genome reads data set which crosses 1TB size. I need a list of reads data set size greater than 1 TB. Please provide the resources. Thanks in advances.
Hello All, I'm looking for a large genome reads data set which crosses 1TB size. I need a list of reads data set size greater than 1 TB. Please provide the resources. Thanks in advances.
Perhaps someone will know this off the top of their head but 1 terabases for a single sample is a lot of sequencing.
If you have access to basespace (I think you can create a free account) then Illumina has WGS data from NovaSeq for NA12878 available. You could download and cat those lanes together. That data should add up to somewhere in the neighborhood of 1 terabase of sequence.
Look at the file called sra-runinfo-2016-09.tar.gz
in
http://data.biostarhandbook.com/sra/
it is a collection of the metadata for all runs in SRA (done last September though). According to that as far as SRA goes the largest data sets are around 0.5 TB (sizes are in MBytes)
curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv
cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head
will print:
SRR923764 533571
SRR444347 404420
SRR2072019 402279
SRR2070592 401535
SRR445893 401316
SRR3472879 401125
SRR2049165 399636
SRR3468433 399398
SRR3485180 396152
SRR3498250 396074
These are compressed files. If you unpack those I think they will run over 1TB.
Dear Istvan,
Thanks for your answer.
When i ran the command you have given in my Ubuntu terminal I got the following error.
**curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head**
**% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 197M 100 197M 0 0 125k 0 0:26:46 0:26:46 --:--:-- 160k
tar: cat: Not found in archive
tar: Exiting with failure status due to previous errors
sra/2016-12.csv
sra/2016-11.csv
sra/2016-10.csv
sra/2016-09.csv
sra/2016-08.csv
sra/2016-07.csv
sra/2016-06.csv
sra/2016-05.csv
sra/2016-04.csv
sra/2016-03.csv**
Could you tell me that the command you have given is correct or not? Actually I have the following reads data set "H Sapiens 2" whose size is 339.5 GB. I'm looking for a data set that is bigger than this one. Thanks in advance.
There is a ;
missing in @Istvan's example. Use this:
curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv ; cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head
Check my comment above. First two three datasets are controlled access so you can't download them directly.
The Genome in a Bottle FTP site has the most comprehensive range of sequencing data for a set of human samples that I am aware of. The full sequencing data for HG002 is a bit over 11Tb and includes sequencing data from chromium, bionano, nanopre, complete genomic, HiSeq, MiSeq, mate pair, moleculo, ion torrent, pacbio, as well as other platforms.
NA12878 (aka HG001) is the de-facto benchmarking sample and is likely to be the most heavily sequenced human individual. You can find sequencing data for this individual from a wide range of source (eg illumina data from GiaB, Illumina platinum genomics,X10 data from Garvan, and the list goes on)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This request is very vague. You need to be more specific. 1TB (or 1 terabases) of data from a single sample, multiple samples? Files of compressed size or uncompressed size?
Thanks for replying. I'm looking from a single sample. Either compressed or uncompressed file is OK for me. If it is compressed file then i can download faster.