Question

Pre-Processing Of Solid Data For Gatk Snp Calling

1

Entering edit mode

11.3 years ago

William ★ 5.3k

When you look in the GATK best practices workflow there are 3 pre processing steps:

1) Duplicate marking

2) Local realingment

3) Base quality recalibration

Now I get contradicting information from different people on whether or not to use these pre processing steps for sequencing data produced on the Solid platform. What do other Solid and GATK users do?

Of course it already makes a difference wether the BWA or Lifescope was used as a mapper, since for one lifescope clips low quality tails and BWA tries to map whole reads including the low quality tails.

Are there SOLID specific characteristics that warrant not using certain preprocessing steps, or warrant using different preprocessing steps?

GATK workflow

gatk solid bam • 6.2k views

ADD COMMENT • link updated 11.3 years ago by brentp 24k • written 11.3 years ago by William ★ 5.3k

0

Entering edit mode

do solid systems produce FASTQ files for forward reads and reverse reads? I have been working with only Illumina data with GATK SNP calling pipeline. I believe preprocessing is required also for SOLID.

ADD REPLY • link 11.3 years ago by samsara ▴ 630

1

Entering edit mode

Native SOLID output is XSQ which lifescope can use for mapping. There are tools to convert the XSQ file to CSFasta and Fastq so you can use other mappers.

ADD REPLY • link 11.3 years ago by William ★ 5.3k

score 2 · Answer 1 · 2013-01-18

in our experience, we have found these particular pre-processing steps very helpful, in terms of the final results obtained. although LifeScope is able to detect some little things that GATK doesn't, the these GATK recommended steps seem to help in 2 ways: the first one is that obviously indel realignment improves significantly small indels detection, and the second one is that by removing duplicates and improving base qualities not only should improve the variant calling itself, but it does improve the variant quality score produced by GATK. in terms of variant prioritization that later on you'll have to go through, this definitely helps.

as a collateral note let me add that we were expecting GATK though to be much more sensible, and that LifeScope's variants would be a very large subset of GATK's, but both programs have their own set of private variants, apart from the large number of variants they share. we've sometimes seen that using the union of these sets, and not the merging as anyone would think to reduce the false positives rate (assuming increasing false negatives, which in clinical environments can't be done so easily), does not really increase the false positives rate, and allows reducing the false negatives rate. this is not a very spread belief, and I just wanted to share it here with you. of course you should check those rates with genotyping controls if you have them (as we did in some experiments).

score 2 · Answer 2 · 2013-01-18

2

Entering edit mode

11.3 years ago

brentp 24k

I used those steps -- with our without duplicate marking, depending on the depth of coverage.

BWA and Shrimp seem to perform the best in terms of setting reasonable mapping qualities and base-qualities and also by the actual variant called.

I have a script that I use to trim the reads (input is .csfasta, .qual) and output is either a fastq for BWA or a fastq for shrimp. That script is here: https://github.com/brentp/bio-playground/blob/master/solidstuff/solid-trimmer.py

The trimming seems to improve calls--even when using BWA's trimming option.

ADD COMMENT • link 11.3 years ago by brentp 24k

1

Entering edit mode

be aware that including or not duplicates on the variant calling algorithm shouldn't depend on the coverage. if a read is duplicated it should be generally removed from the variant calling analysis, and only in very particular cases (rnaseq for instance, where duplicate reads detected by software could be explained by other means) you could allow those to go into your pipeline. if your coverage is very low you can increase your calling power by merging multiple samples and running GATK on them. in fact that's precisely how 1000g project does with their very low coverage whole genome sequencing samples: the variant calling power comes from the fact that they sequence lots of samples, and then they run GATK on multiple samples at a time.

ADD REPLY • link 11.3 years ago by Jorge Amigo 14k

1

Entering edit mode

your first sentence is not universally true. what if i'm doing targetted resequencing with very deep coverage so that we expect to have duplicates by chance?

ADD REPLY • link 11.3 years ago by brentp 24k

1

Entering edit mode

then you would have to remove them, wouldn't you? or are you saying that, as the coverage is so high, the influence of those duplicates on the final results would be so low that you would prefer to save computing time? sorry if I'm not getting your point. I just haven't come across any targetted resequencing experiment that would demand not to remove duplicates. why would you not want to do so? what would be the underlying reason to use duplicates in a variant calling analysis?

ADD REPLY • link 11.3 years ago by Jorge Amigo 14k

1

Entering edit mode

Because the what you are removing then is signal and not noise. With very high coverage targeted sequencing there is a much higher change that the duplicate reads represent independent reads and should be used like that in SNP callers for computing a score for the SNP.

ADD REPLY • link 11.3 years ago by William ★ 5.3k

0

Entering edit mode

aye, well said.

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

ok, I think I get the message. but to follow the discussion, if very deep coverage does indeed drastically increase the chance of reads being independent (reads that otherwise would be considered duplicated), wouldn't then the duplicate reads removal algorithm have to consider this? or is it so complicated to distinguish them that you directly decide not to remove any duplicates at all? what would be the coverage threshold to feel confident with not removing duplicates? I can understand the underlying reason, but unfortunately I haven't come across any statement of this issue I could use. I see Samtools + Picard Markduplicates that "very deep coverage" refers to something above 200x, but could you recommend any further readings on this matter?

ADD REPLY • link 11.3 years ago by Jorge Amigo 14k

0

Entering edit mode

I haven't seen much written on it. But have hit the limit on some of our deep coverage data --- which is much much deeper than 200x, by 10 to 100-fold.

ADD REPLY • link 11.2 years ago by brentp 24k

0

Entering edit mode

so, just to end up with the discussion, is it then a performance decision knowing that the benefits you get with the duplicates removal would be diluted among that huge coverage, or is it a fundamental reason seeing that the removal algorithms do not perform well when fed with so many reads? from the previous coments I get the feeling that you're blaming the algorithms, which seem to be underperforming at very deep coverage, but I just want to make sure that I get the message right.

ADD REPLY • link 11.2 years ago by Jorge Amigo 14k

0

Entering edit mode

maybe I'm underestimating the sophistication of the algs. let's say my target region is 100 bases and I have 100 single-end reads of 100bp length covering that region. even in that case, I expect there to be non-pcr duplicates mapping to the same location. but now, what if I have 10,000 reads. Then, I know that there will be mostly "duplicates". Are you suggesting that current duplicate-marking software will take that into account and find bases with a depth greater than expected based on the observed distribution of coverage?

ADD REPLY • link 11.2 years ago by brentp 24k

0

Entering edit mode

oh, not at all, I was just very curious about this issue since I never considered whether to remove duplicates or not: I just choose to remove them (our experiments never go as deep as you describe anyway). I thought that these algorithms were coverage independent simply because I never read anything stating that the coverage was used to decide whether a read had to be considered duplicated or not. I can understand that if you're empowering the experiment to give you tens of thousands coverage you may expect very low effect of duplicates reads, and it will probably require larger computational resources, although I was trying to elucidate the very basis of your argument. although I can't make up my mind completely about this, thanks for the discussion anyway because it has forced me to do some interesting research on this matter.

ADD REPLY • link 11.2 years ago by Jorge Amigo 14k