News:Free HiSeq X Ten human genome fastq test data
1
7
Entering edit mode
8.5 years ago
rbagnall ★ 1.8k

Download test data from the Illumina HiSeq X Ten from the Garven Institute, Australia, at the AllSeq website.

Fastq files available for NA12878D and NA12878J. Bam, Fastqc and Picard Mark duplicates metrics file available too.

Files available without registering until September 30, 2014

http://allseq.com/x-ten-test-data

HiSeq-X-Ten Human-genome Fastq next-gen • 5.1k views
5
Entering edit mode
8.5 years ago

I am looking at the plot of kmer content. It looks a bit ... crazy ... as if most of the data were made of just a few patterns.

https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten/fastqc-statistics/NA12878D_HiSeqX_R1.stats-fastqc.html#M9

0
Entering edit mode

That is weird. maybe those are the only over-represented seqs. Even the CCCCC is only appearing at 15X expected rate.

Could also be something about fastqc's sampling (I assume they don't count everything).

0
Entering edit mode

Something is definitely wrong there. Their FastQC report indicates useless data.

I got intrigued and generated my own fastqc report. As it turns out mine is quite different than theirs, beyond just using a newer version of FastQC.

http://apollo.huck.psu.edu/data/NA12878D_HiSeqX_R1_fastqc.html

In my report the kmer content does actually make sense.

But then of course I can't help but wonder, if we can't even get the same FastQC report out of the data how are we going to reconcile more complicated information.

0
Entering edit mode

Their fastqc report is on the bam file NA12878D_HiSeqX_R1.bam, rather than the fastq file.

1
Entering edit mode

If both mapped and unmapped reads are included then using a BAM file should not make any difference.

What will make a difference (apparently in this case a huge one) is something that I have only realized at this very moment. When someone runs a FastQC on a sorted BAM file the results may end up biased towards the properties of the data that map at the start of the genome, whatever those may be. Kmer and sequence duplication only uses the first 200,000 or 2% of data. Normally raw data is not ordered in any predictable way relative to the genome..

Also the bam file contains both reads not just read1. I'll run the report for read2 by tomorrow.

0
Entering edit mode

I have now rerun the FastQC reports on each read file as well as on the BAM file. My plots are do not match the reports they have produced.

0
Entering edit mode

weird stuff going on around 50 bases.