Question: Snp Indel Discovery - Critique A Proposed Workflow
gravatar for Travis
8.6 years ago by
Travis2.8k wrote:

Hi all,

I am very new to this kind of work so I would appreciate some opinions on the early stages of a SNP/Indel discovery workflow.

  1. Formulate list of genes of interest.
  2. Design custom capture for all exons from these genes.
  3. Sequence (Illumina, Paired End, 100bp reads, 50x coverage minimum)
  4. Remove duplicate paired end reads.
  5. Align to genome (Bfast to allow gaps and find indels as well as SNPs)
  6. Use SAMTools mpileup with varfilt to remove SNPs with a quality score less than 20 or indels with a score less than 50.
  7. Remove SNPs with low coverage (less than 30x?)
  8. Proceed to association type study (details to be worried about later - my major concern is the upstream, NGS stuff at the moment as I have never done it before).

All comments welcome. Thanks in advance!

association samtools snp • 3.1k views
ADD COMMENTlink modified 8.6 years ago by Doctoroots780 • written 8.6 years ago by Travis2.8k

Hi Travis. This all looks well thought of and solid to me. I don't know what I would add other than discussing assembly details and such. Would you have more specific questions about some parts of this workflow?

ADD REPLYlink written 8.6 years ago by Eric Normandeau10k

In case you are working with tumors, I would add that you should sequence the germline matched DNA to substract the germline snps and be able to distinguish the somatic events.

ADD REPLYlink written 8.6 years ago by toni2.1k

Thanks a lot for the responses guys!

Tony: This study won't be cancer-related. Perhaps a good thing as it sounds like that complicates things.

Eric: More specifics are welcome. I am a complete newbie - I have never run any of these tools before!

ADD REPLYlink written 8.6 years ago by Travis2.8k

Minor detail: it's easier to remove duplicates after aligning.

ADD REPLYlink written 8.6 years ago by Mikael Huss4.7k

Come to think of it, I don't even to remove duplicate pairs yet! Any pointers? :)

ADD REPLYlink written 8.6 years ago by Travis2.8k

I am not sure how removing exact paired-end duplicates is going to change anything. Moreover, if you do exon capture, you may end up with too short sequences to do paired-end, at-least if you use an array with oligos representing your exons on it. Since you have no specific question in there, it is a bit hard to discuss specific details.

ADD REPLYlink written 8.6 years ago by Eric Normandeau10k

Dont for get:

7.5. What to do when we got too many / no sensible variants.

ADD REPLYlink written 8.6 years ago by brentp23k
gravatar for David Quigley
8.6 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

You may find the Broad's Best Practices for Variant Detection of some use. You can do duplicate detection with Picard's MarkDuplicates.

One step you're missing is local realignment around InDels. Quoting the Broad:

The Multiple Sequence Alignment (MSA) realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome (see Figure). Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment.

ADD COMMENTlink written 8.6 years ago by David Quigley11k

Excellent pointers and I'll definitely look in that direction! A great help :)

ADD REPLYlink written 8.6 years ago by Travis2.8k
gravatar for Doctoroots
8.6 years ago by
Doctoroots780 wrote:

I agree that your workflow is indeed comprehensive and well thought. some additional considerations to add to those stated by David:

i would consider setting a lower limit for SNP inclusion (less than X30). you should be mindful that setting a higher lower limit was shown to decrease sensitivity without a substantial gain in specificity and should be considered in your specific experimental context.

comparing your called SNPs against the dbSNP data base is also considered good practice, if the proportion of novel SNPs in your data is suspiciously high (>10% in coding regions), you should consider setting a higher quality bar.

SNP calling quality could also be assessed by checking the Transition/Transversion ratio which is expected to be around 3.3 in whole exome sequencing.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Doctoroots780

These are changes and additions I'll definitely take into account. The sequence provider was offering x150 coverage on our selected genes! Do you think I should perhaps increase the number of genes/decrease the coverage?

ADD REPLYlink written 8.6 years ago by Travis2.8k

Hi Travis, i believe x150 is more than enough, and if you have additional genes you are interested in, you should consider adding them at the cost of lowering the coverage (financially speaking). These 2 papers: claim coverage of x20 is sufficient for SNP calling. i would say that if you extend this to x50 you obtain sufficient coverage.

ADD REPLYlink written 8.6 years ago by Doctoroots780
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1228 users visited in the last hour