Question: Pre-processing/QC of sequencing data Q's
gravatar for lzass18
3.1 years ago by
lzass1810 wrote:

Hi all,

Let me start of by saying, I have very very limited knowledge with regards to NGS and I'm truly the definition of a beginner. I have a couple of questions regarding certain steps during pre-processing of sequencing data from ANY platform, sorry if this has been asked before.

1) Given a fastq file (from any platform), how does one identify the location and orientation of an adaptor sequence?

2) How does one define the minimum bp length to infer adaptor? Or otherwise, if I'm working with a given fastq file, and I grep the adaptor sequence, obviously the full adaptor sequence doesn't appear in all the reads, so how low can I go with the amount of bp in the adaptor sequence I grep for?

Hopefully these questions make sense.

sequencing next-gen • 812 views
ADD COMMENTlink modified 3.1 years ago by Devon Ryan95k • written 3.1 years ago by lzass1810
gravatar for Devon Ryan
3.1 years ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:
  1. Run it through FastQC, which will mention this in one of its plots. The orientation and location will always be the same (it'll be at the 3' end and probably reverse complemented from the sequence you were given).
  2. 5 bases is probably an OK threshold for most applications. If you're doing bisulfite sequencing then you should be more aggressive (1-2 bases), but for most things you can use local alignment and call it done.
ADD COMMENTlink written 3.1 years ago by Devon Ryan95k

Thank you so much! This helps a lot!

If you have the time, can you perhaps help me with the following as well:

i) Are any QC measures taken within the lab to measure sequencing success? How do sequencing service providers determine whether samples need to be re-sequenced?

ii) Does adaptor trimming influence the quality of the rest of the data?

iii) Is it better to remove duplicate reads before or after alignment? I've seen SOP's doing it both way, does it matter?

iv) Finally, skipping ahead a bit, when one detects variants, how does one determine whether they are novel?

Thank you!

ADD REPLYlink written 3.1 years ago by lzass1810
  1. It varies. We check off-species contamination rates, duplication rates and everything that FastQC produces. On the machine one checks the % bases >Q30 and the % undetermined indices.
  2. No
  3. After, it's a pain to do before. Normally one doesn't even need to remove duplicates, just mark them if that's important (it's not always, e.g., in RNAseq).
  4. Compare to large databases.
ADD REPLYlink written 3.1 years ago by Devon Ryan95k

Regarding questions 1 & 4:

  1. Is there a particular % ratio that helps one determine sequencing success/failure? Similarly, are there specific off-contamination and duplication rates that specify that sequencing is bad and requires re-sequencing?

  2. What are currently the largest/most-widely used databases for human variant?

This is an immense help, can't thank you enough!!

ADD REPLYlink written 3.0 years ago by lzass1810
  1. For Q30 bases we're normally expecting >90%, but this can vary a bit by type of machine (talk to your Illumina rep and they can tell you the specs to expect). For off-species contamination, I flag anything above 0.5% (this percent only includes "uniquely aligned reads" as defined by fastq_screen).
  2. I imagine that you can get something large from ExAC.
ADD REPLYlink written 3.0 years ago by Devon Ryan95k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour