Entering edit mode
8.8 years ago
rbagnall ★ 1.8k
Download test data from the Illumina HiSeq X Ten from the Garven Institute, Australia, at the AllSeq website.
Fastq files available for NA12878D and NA12878J. Bam, Fastqc and Picard Mark duplicates metrics file available too.
Files available without registering until September 30, 2014
That is weird. maybe those are the only over-represented seqs. Even the CCCCC is only appearing at 15X expected rate.
Could also be something about fastqc's sampling (I assume they don't count everything).
Something is definitely wrong there. Their FastQC report indicates useless data.
I got intrigued and generated my own fastqc report. As it turns out mine is quite different than theirs, beyond just using a newer version of FastQC.
In my report the kmer content does actually make sense.
But then of course I can't help but wonder, if we can't even get the same FastQC report out of the data how are we going to reconcile more complicated information.
Their fastqc report is on the bam file
NA12878D_HiSeqX_R1.bam, rather than the fastq file.
If both mapped and unmapped reads are included then using a BAM file should not make any difference.
What will make a difference (apparently in this case a huge one) is something that I have only realized at this very moment. When someone runs a FastQC on a sorted BAM file the results may end up biased towards the properties of the data that map at the start of the genome, whatever those may be. Kmer and sequence duplication only uses the first 200,000 or 2% of data. Normally raw data is not ordered in any predictable way relative to the genome..
Also the bam file contains both reads not just read1. I'll run the report for read2 by tomorrow.
I have now rerun the FastQC reports on each read file as well as on the BAM file. My plots are do not match the reports they have produced.
weird stuff going on around 50 bases.