Hi all,
Let me start of by saying, I have very very limited knowledge with regards to NGS and I'm truly the definition of a beginner. I have a couple of questions regarding certain steps during pre-processing of sequencing data from ANY platform, sorry if this has been asked before.
1) Given a fastq file (from any platform), how does one identify the location and orientation of an adaptor sequence?
2) How does one define the minimum bp length to infer adaptor? Or otherwise, if I'm working with a given fastq file, and I grep the adaptor sequence, obviously the full adaptor sequence doesn't appear in all the reads, so how low can I go with the amount of bp in the adaptor sequence I grep for?
Hopefully these questions make sense.
Thank you so much! This helps a lot!
If you have the time, can you perhaps help me with the following as well:
i) Are any QC measures taken within the lab to measure sequencing success? How do sequencing service providers determine whether samples need to be re-sequenced?
ii) Does adaptor trimming influence the quality of the rest of the data?
iii) Is it better to remove duplicate reads before or after alignment? I've seen SOP's doing it both way, does it matter?
iv) Finally, skipping ahead a bit, when one detects variants, how does one determine whether they are novel?
Thank you!
Regarding questions 1 & 4:
Is there a particular % ratio that helps one determine sequencing success/failure? Similarly, are there specific off-contamination and duplication rates that specify that sequencing is bad and requires re-sequencing?
What are currently the largest/most-widely used databases for human variant?
This is an immense help, can't thank you enough!!