Question

Pre-processing/QC of sequencing data Q's

0

Entering edit mode

6.9 years ago

lzass18 ▴ 10

Hi all,

Let me start of by saying, I have very very limited knowledge with regards to NGS and I'm truly the definition of a beginner. I have a couple of questions regarding certain steps during pre-processing of sequencing data from ANY platform, sorry if this has been asked before.

1) Given a fastq file (from any platform), how does one identify the location and orientation of an adaptor sequence?

2) How does one define the minimum bp length to infer adaptor? Or otherwise, if I'm working with a given fastq file, and I grep the adaptor sequence, obviously the full adaptor sequence doesn't appear in all the reads, so how low can I go with the amount of bp in the adaptor sequence I grep for?

Hopefully these questions make sense.

sequencing next-gen • 1.5k views

ADD COMMENT • link updated 6.9 years ago by Devon Ryan 104k • written 6.9 years ago by lzass18 ▴ 10

score 2 · Accepted Answer · 2017-06-22

2

Entering edit mode

6.9 years ago

Devon Ryan 104k

Run it through FastQC, which will mention this in one of its plots. The orientation and location will always be the same (it'll be at the 3' end and probably reverse complemented from the sequence you were given).
5 bases is probably an OK threshold for most applications. If you're doing bisulfite sequencing then you should be more aggressive (1-2 bases), but for most things you can use local alignment and call it done.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much! This helps a lot!

If you have the time, can you perhaps help me with the following as well:

i) Are any QC measures taken within the lab to measure sequencing success? How do sequencing service providers determine whether samples need to be re-sequenced?

ii) Does adaptor trimming influence the quality of the rest of the data?

iii) Is it better to remove duplicate reads before or after alignment? I've seen SOP's doing it both way, does it matter?

iv) Finally, skipping ahead a bit, when one detects variants, how does one determine whether they are novel?

Thank you!

ADD REPLY • link 6.9 years ago by lzass18 ▴ 10

1

Entering edit mode

It varies. We check off-species contamination rates, duplication rates and everything that FastQC produces. On the machine one checks the % bases >Q30 and the % undetermined indices.
No
After, it's a pain to do before. Normally one doesn't even need to remove duplicates, just mark them if that's important (it's not always, e.g., in RNAseq).
Compare to large databases.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

1

Entering edit mode

Regarding questions 1 & 4:

Is there a particular % ratio that helps one determine sequencing success/failure? Similarly, are there specific off-contamination and duplication rates that specify that sequencing is bad and requires re-sequencing?
What are currently the largest/most-widely used databases for human variant?

This is an immense help, can't thank you enough!!

ADD REPLY • link 6.9 years ago by lzass18 ▴ 10

1

Entering edit mode

For Q30 bases we're normally expecting >90%, but this can vary a bit by type of machine (talk to your Illumina rep and they can tell you the specs to expect). For off-species contamination, I flag anything above 0.5% (this percent only includes "uniquely aligned reads" as defined by fastq_screen).
I imagine that you can get something large from ExAC.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k