Question

Snp Indel Discovery - Critique A Proposed Workflow

13

Entering edit mode

13.0 years ago

Travis ★ 2.8k

Hi all,

I am very new to this kind of work so I would appreciate some opinions on the early stages of a SNP/Indel discovery workflow.

Formulate list of genes of interest.
Design custom capture for all exons from these genes.
Sequence (Illumina, Paired End, 100bp reads, 50x coverage minimum)
Remove duplicate paired end reads.
Align to genome (Bfast to allow gaps and find indels as well as SNPs)
Use SAMTools mpileup with varfilt to remove SNPs with a quality score less than 20 or indels with a score less than 50.
Remove SNPs with low coverage (less than 30x?)
Proceed to association type study (details to be worried about later - my major concern is the upstream, NGS stuff at the moment as I have never done it before).

All comments welcome. Thanks in advance!

snp association samtools • 4.3k views

ADD COMMENT • link updated 13.0 years ago by Doctoroots ▴ 800 • written 13.0 years ago by Travis ★ 2.8k

0

Entering edit mode

Hi Travis. This all looks well thought of and solid to me. I don't know what I would add other than discussing assembly details and such. Would you have more specific questions about some parts of this workflow?

ADD REPLY • link 13.0 years ago by Eric Normandeau 11k

0

Entering edit mode

In case you are working with tumors, I would add that you should sequence the germline matched DNA to substract the germline snps and be able to distinguish the somatic events.

ADD REPLY • link 13.0 years ago by toni ★ 2.2k

0

Entering edit mode

Thanks a lot for the responses guys!

Tony: This study won't be cancer-related. Perhaps a good thing as it sounds like that complicates things.

Eric: More specifics are welcome. I am a complete newbie - I have never run any of these tools before!

ADD REPLY • link 13.0 years ago by Travis ★ 2.8k

0

Entering edit mode

Minor detail: it's easier to remove duplicates after aligning.

ADD REPLY • link 13.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Come to think of it, I don't even know.how to remove duplicate pairs yet! Any pointers? :)

ADD REPLY • link 13.0 years ago by Travis ★ 2.8k

0

Entering edit mode

I am not sure how removing exact paired-end duplicates is going to change anything. Moreover, if you do exon capture, you may end up with too short sequences to do paired-end, at-least if you use an array with oligos representing your exons on it. Since you have no specific question in there, it is a bit hard to discuss specific details.

ADD REPLY • link 13.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Dont for get:

7.5. What to do when we got too many / no sensible variants.

ADD REPLY • link 13.0 years ago by brentp 24k

score 6 · Answer 1 · 2011-05-03

You may find the Broad's Best Practices for Variant Detection of some use. You can do duplicate detection with Picard's MarkDuplicates.

One step you're missing is local realignment around InDels. Quoting the Broad:

The Multiple Sequence Alignment (MSA) realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome (see Figure). Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment.

score 4 · Answer 2 · 2011-05-03

4

Entering edit mode

13.0 years ago

Doctoroots ▴ 800

I agree that your workflow is indeed comprehensive and well thought. some additional considerations to add to those stated by David:

i would consider setting a lower limit for SNP inclusion (less than X30). you should be mindful that setting a higher lower limit was shown to decrease sensitivity without a substantial gain in specificity and should be considered in your specific experimental context.

comparing your called SNPs against the dbSNP data base is also considered good practice, if the proportion of novel SNPs in your data is suspiciously high (>10% in coding regions), you should consider setting a higher quality bar.

SNP calling quality could also be assessed by checking the Transition/Transversion ratio which is expected to be around 3.3 in whole exome sequencing.

ADD COMMENT • link 13.0 years ago by Doctoroots ▴ 800

0

Entering edit mode

These are changes and additions I'll definitely take into account. The sequence provider was offering x150 coverage on our selected genes! Do you think I should perhaps increase the number of genes/decrease the coverage?

ADD REPLY • link 13.0 years ago by Travis ★ 2.8k

0

Entering edit mode

Hi Travis, i believe x150 is more than enough, and if you have additional genes you are interested in, you should consider adding them at the cost of lowering the coverage (financially speaking). These 2 papers: http://nar.oxfordjournals.org/content/36/16/e105.short http://nar.oxfordjournals.org/content/35/15/e97.short claim coverage of x20 is sufficient for SNP calling. i would say that if you extend this to x50 you obtain sufficient coverage.

ADD REPLY • link 13.0 years ago by Doctoroots ▴ 800