Question

Large reads data set which crosses 1TB size?

0

Entering edit mode

6.7 years ago

saranpons3 ▴ 70

Hello All, I'm looking for a large genome reads data set which crosses 1TB size. I need a list of reads data set size greater than 1 TB. Please provide the resources. Thanks in advances.

Genome reads • 1.9k views

ADD COMMENT • link updated 6.7 years ago by d-cameron ★ 2.9k • written 6.7 years ago by saranpons3 ▴ 70

1

Entering edit mode

This request is very vague. You need to be more specific. 1TB (or 1 terabases) of data from a single sample, multiple samples? Files of compressed size or uncompressed size?

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks for replying. I'm looking from a single sample. Either compressed or uncompressed file is OK for me. If it is compressed file then i can download faster.

ADD REPLY • link 6.7 years ago by saranpons3 ▴ 70

score 2 · Answer 1 · 2017-07-31

Perhaps someone will know this off the top of their head but 1 terabases for a single sample is a lot of sequencing.

If you have access to basespace (I think you can create a free account) then Illumina has WGS data from NovaSeq for NA12878 available. You could download and cat those lanes together. That data should add up to somewhere in the neighborhood of 1 terabase of sequence.

Ram · Answer 2 · 2017-07-31

1

Entering edit mode

6.7 years ago

Istvan Albert 100k

Look at the file called sra-runinfo-2016-09.tar.gz in

http://data.biostarhandbook.com/sra/

it is a collection of the metadata for all runs in SRA (done last September though). According to that as far as SRA goes the largest data sets are around 0.5 TB (sizes are in MBytes)

curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv
cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head

will print:

SRR923764   533571
SRR444347   404420
SRR2072019  402279
SRR2070592  401535
SRR445893   401316
SRR3472879  401125
SRR2049165  399636
SRR3468433  399398
SRR3485180  396152
SRR3498250  396074

These are compressed files. If you unpack those I think they will run over 1TB.

ADD COMMENT • link 6.7 years ago by Istvan Albert 100k

0

Entering edit mode

Many of these samples are complete genomics data (some of the top ones are also controlled access).

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Dear Istvan,

Thanks for your answer.

When i ran the command you have given in my Ubuntu terminal I got the following error.

**curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head**

**% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100  197M  100  197M    0     0   125k      0  0:26:46  0:26:46 --:--:--  160k
tar: cat: Not found in archive
tar: Exiting with failure status due to previous errors
sra/2016-12.csv 
sra/2016-11.csv 
sra/2016-10.csv 
sra/2016-09.csv 
sra/2016-08.csv 
sra/2016-07.csv 
sra/2016-06.csv 
sra/2016-05.csv 
sra/2016-04.csv 
sra/2016-03.csv**

Could you tell me that the command you have given is correct or not? Actually I have the following reads data set "H Sapiens 2" whose size is 339.5 GB. I'm looking for a data set that is bigger than this one. Thanks in advance.

ADD REPLY • link updated 6.7 years ago by Ram 43k • written 6.7 years ago by saranpons3 ▴ 70

1

Entering edit mode

The two commands should go on different lines.

Execute the first then the second.

ADD REPLY • link 6.7 years ago by Istvan Albert 100k

0

Entering edit mode

There is a ; missing in @Istvan's example. Use this:

curl http://data.biostarhandbook.com/sra/sra-runinfo-2016-09.tar.gz | tar zxv ; cat sra/* | awk -F ',' ' { print $1 "\t" $8 } ' | sort -rn -k2,2 | head

Check my comment above. First two three datasets are controlled access so you can't download them directly.

ADD REPLY • link 6.7 years ago by GenoMax 141k

score 1 · Answer 3 · 2017-08-03

The Genome in a Bottle FTP site has the most comprehensive range of sequencing data for a set of human samples that I am aware of. The full sequencing data for HG002 is a bit over 11Tb and includes sequencing data from chromium, bionano, nanopre, complete genomic, HiSeq, MiSeq, mate pair, moleculo, ion torrent, pacbio, as well as other platforms.

NA12878 (aka HG001) is the de-facto benchmarking sample and is likely to be the most heavily sequenced human individual. You can find sequencing data for this individual from a wide range of source (eg illumina data from GiaB, Illumina platinum genomics,X10 data from Garvan, and the list goes on)