How to access specifically 30x NA12878 sequencing runs
5
0
Entering edit mode
4.0 years ago
Fungsten • 0

I see in many referenced papers mentioning WGS 30x from sample NA12878, like in the following supplementary material:

https://www.biorxiv.org/content/biorxiv/suppl/2018/01/09/092890.DC5/092890-1.pdf

What I cannot find are instructions on how to access or generate the same FASTQ files. These datasets seem to be quite essential for benchmarking purposes, but I am not sure what is the best way to gather them.

Many thanks

giab benchmark fastq wgs data-access • 4.3k views
1
Entering edit mode
4.0 years ago

We are providing deep whole genome sequence data for the CEPH 1463 family in order to create a "platinum" standard comprehensive set of variant calls. These genomes include a trio (NA12877 NA12878 and NA12882) sequenced to greater than 200x depth of coverage, as well as a technical replicate (separate library and sequencing, but same DNA sample) of NA12882 also sequenced to greater than 200x. Additional information and analyses will be provided at www.platinumgenomes.org.

0
Entering edit mode
0
Entering edit mode

Already tried those. Try to get the high coverage and will see the downloaded file doesn't make any sense for high coverage...

0
Entering edit mode

Try to get the high coverage and will see the downloaded file doesn't make any sense for high coverage...

What does this mean?

0
Entering edit mode
4.0 years ago
husensofteng ▴ 380

SRA explorer returns several projects on SRA that have the WGS raw data for NA12878. You could type NA12878 in the search box and add the desired results to the collection, then you get direct links to the fastq files from the save datasets button at the top of the page.

However, to identify which one provides the 30x dataset you may have to check the number of reads in the last column or go to the project home page on NCBI (just click on the accession in the second column of the result page).

0
Entering edit mode
4.0 years ago

The paper you shared directly links to this ftp: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/ for patient NA12878.

I'm not sure about the naming/ordering of the folders in that directory. All the folders I checked contained at least alignments (.bam files) - you could always convert those to FASTAs if you want to re-align or something. Some of the folders do contain actual fastq/fasta files like Garvan_NA12878_HG001_HiSeq_Exome. Also, the paper you linked specifically mentioned analyzing the BAM files (which makes sense) - not FASTQs.

0
Entering edit mode
4 months ago
geocarvalho ▴ 310

https://www.internationalgenome.org/data-portal/data-collection/30x-grch38

The New York Genome Center (NYGC), funded by NHGRI, has sequenced 3202 samples from the 1000 Genomes Project sample collection to 30x coverage. Initially, the 2504 unrelated samples from the phase three panel from the 1000 Genomes Project were sequenced. Thereafter, an additional 698 samples, related to samples in the 2504 panel, were also sequenced. NYGC aligned the data to GRCh38 and those alignments are publicly available along with a data reuse statement. Details, including URLs for the data in ENA, are in our data portal (below) and are listed on our FTP site.

I downloaded the TSV from https://www.ebi.ac.uk/ena/data/view/PRJEB31736, inside it you will find the FTP link. I just downloaded the CRAM files I wanted using wget.

$grep NA12878 filereport_read_run_PRJEB31736_tsv.txt | grep cram PRJEB31736 SAMN00801888 ERX3266709 ERR3239334 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR323/004/ERR3239334/ERR3239334_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR323/004/ERR3239334/ERR3239334_2.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram$ wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR323/004/ERR3239334/ERR3239334_1.fastq.gz .