Question: What Are Some Sanity Checks That Should Be Performed On Ngs Data?
gravatar for Jeremy Leipzig
8.3 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I am compiling a wish list of analyses that should be run for every data set coming out of a sequencing facility - regardless of whether this sequence is for RNA-Seq, SNP calling, ChIP-Seq, or possibly de novo sequencing. The goal is to scan for potential red flags that would possibly indicate something has gone awry either in the lab or downstream. I want a list of "sanity checks" that will encompass both sequence quality analysis as well as what can be gleamed from alignments.

For example,

  • Sequence QA - basecalling bias, read quality, yield, throughput, GC bias, 5'/3' motifs?, restriction enzyme bias

  • Barcode distribution (if barcoded)

  • Alignment QA

    • chromosome bias,
    • annotational biases (whether experimentally induced or not)
      • genes, repeats, cpg islands, epigenetic markers, expression

I am sure this has already been implemented at a lot of the bigger sequencing cores - I just need a definitive list. Of course, many of these sanity checks will be triggered by the experiments themselves - the point is to develop a comprehensive checklist of analyses that will encompass both what we expect to see as well as what we don't.

ADD COMMENTlink modified 16 days ago by Biostar ♦♦ 20 • written 8.3 years ago by Jeremy Leipzig18k

Any final words on the final definitive list? I have been working on a way to show "positional diversity" in FastQ reads: Basically an analysis of the diversity of k-mers.

ADD REPLYlink written 7.7 years ago by Justin Brown40

great program, although I think raw tabular output would be welcome to developers

ADD REPLYlink written 6.9 years ago by Jeremy Leipzig18k

great question! also interested to see strand bias relative to annotations for RNA-Seq.

ADD REPLYlink written 8.3 years ago by brentp22k

@Bio_X2Y, that hexamer bias is visible in the FastQC output.

ADD REPLYlink written 8.3 years ago by brentp22k

I suppose a kmer analysis for every dataset would not be unreasonable

ADD REPLYlink written 8.3 years ago by Jeremy Leipzig18k

I'm not sure if there's value in checking for this, but different platforms can introduce different biases. E.g. Illumina's random priming isn't really random:

ADD REPLYlink written 8.3 years ago by Bio_X2Y3.6k

@brentp - thanks! I was aiming to highlight that platform-specific biases exist, I just used this as an example because I don't know of any others :)

ADD REPLYlink written 8.3 years ago by Bio_X2Y3.6k
gravatar for Bio_X2Y
8.3 years ago by
Bio_X2Y3.6k wrote:

We use FASTQC to perform a barrage of quality checks - you might get some useful ideas there.

We also quantify the amount of rRNA reads in our Illumina GA datasets - we hope to see around 4-6%.

ADD COMMENTlink written 8.3 years ago by Bio_X2Y3.6k

+1 for FASTQC, it's the starting point for all our analyses.

ADD REPLYlink written 8.3 years ago by brentp22k

People landing here, check out multiqc - it works with fastqc to make a nice combined report for all reads in a directory.

ADD REPLYlink written 6 weeks ago by chris86240

Yes I am seeing a few checks i hadn't listed as well as some great ideas for visualization -sequence length distribution -sequence duplication levels -overrepresented sequences

ADD REPLYlink written 8.3 years ago by Jeremy Leipzig18k
gravatar for Aaron Statham
8.3 years ago by
Aaron Statham1.1k
Aaron Statham1.1k wrote:

In addition to running FASTQC on every lane of sequencing, in my mapping pipeline I record the number of

  • Raw (purity filtered) reads
  • Unmappable reads
  • Multimapping reads
  • Uniquely mapping reads
  • Final number of reads after removing duplicates with Picard

These metrics tell us a few things eg certain types of experiments you expect to have more multimapping reads (DNA methylation pull downs), and the % of reads which are removed as duplicates really goes up when we're scraping the bottom of the tube when it comes to how much template we manage to get into library prep. Of course interpretation of these numbers really depends on the biological experiment going on.

ADD COMMENTlink written 8.3 years ago by Aaron Statham1.1k

I'm working with viral samples and I've running against this duplicate read problem. Do you know of a reference where this low concentration of DNA and high duplicate read count is discussed? Many thanks.

ADD REPLYlink written 8.3 years ago by Jake150
gravatar for Paige
7.8 years ago by
Paige40 wrote:

I definitely agree with the above post. Another very useful metric is the library complexity...this can be generated by running Picard's MarkDuplicates tool.

ADD COMMENTlink written 7.8 years ago by Paige40

I agree that library complexity is an important metric. However, simply counting duplicates is perhaps an overly simplistic assessment of library complexity. At the very least the number of duplicates should be relative to the total number of reads in the library...

ADD REPLYlink written 7.3 years ago by Malachi Griffith17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2251 users visited in the last hour