Question

Workflow For Using Unified Genotyper For Bioscope Aligned Solid Targeted Capture Sequence

0

Entering edit mode

10.9 years ago

Occam ▴ 410

My group has generated Solid 4 targeted capture sequence with initial alignment using Solid's in-house Bioscope tool. We are trying to use GATK for variant calling (for consensus comparisons) but keep running into issues. After troubleshooting several issues, it seems that the fundamental problem is difference in sorting and labeling of chromosomes between Bioscope's BAMs and GATK's reference (although there may be other issues upstream of this one too).

The GATK developer's response to an earlier query was to realign the reads using GATK's reference and thus avoid all downstream problems; however, preliminary efforts twd this don't look promising as Bioscope doesn't like the GATK reference. This series of difficulties has raised the question, "Surely, someone else has performed GATK variant calling on Solid sequenced and aligned data?" So, we are curious if any one could point us a workflow for this process?

In the event that you do not know of a comprehensive workflow for the process, any ad hoc advice would also be helpful. Right now we considering two kludgey solutions to workaround. First, deleting all reads in our BAMs mapping to chrM, chrX and chrY, as our capture library included no sequence from these chr's and they are the ones ordered differently btwn our BAMs and the GATK reference. Second, reordering the bams using the Picard tool reordersam. Any thoughts on the viability of either of these approaches? Are you seeing anything we are missing?

gatk solid bam • 3.0k views

ADD COMMENT • link updated 10.9 years ago by Jorge Amigo 14k • written 10.9 years ago by Occam ▴ 410

0

Entering edit mode

cross-posted on http://gatkforums.broadinstitute.org/discussion/2689

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

although I think that posting this question on the GATK forums would obtain the best or at least the most proper answer, posting it here or on SeqAnswers allows sharing experiences and finding out how other SOLiD users solved (or tried to solve) this type of common issues.

ADD REPLY • link 10.9 years ago by Jorge Amigo 14k

score 1 · Answer 1 · 2013-05-21

1

Entering edit mode

10.9 years ago

Ashutosh Pandey 12k

I have performed analysis where I aligned SOLiD reads using lifescope and then fed the BAM files as input to GATK. Make sure you modify your header as BAM files produced by Lifescope have meaningless header. They don't contain the information as it is required to be but this is not that important if you are not dealing with multiple libraries. I think reordering the bam files using Picard is the most preferred option. Check this link if you haven't: http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept . The chromosomes in BAM should have "chr" prefix attached and sorted in the official reference canonical order and your reference genome fasta file should have chromosomes sorted in the same way. Even I struggled for a bit doing what you are trying to do now but it was not that hard either.

ADD COMMENT • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Many thanks for the response. Ideally, we would like stick with the Picard BAM reordering approach. To be clear, are you suggesting that we need to clean up the BAM headers after reordering w/ Picard? If so, could you offer suggestions as to what specific header changes need to be made? Sorry for the naive questions. Any help is appreciated.

ADD REPLY • link 10.9 years ago by Occam ▴ 410

0

Entering edit mode

If you have a single bam file that contains reads from a single library (which automatically means that you have a single sample) then I dont think GATK will create any problem as long as you follow the guidelines for the input file for the GATK. If you are sure that the error is due to difference in the order of the chromosome then I dont think focusing on header will help in anyway. Read this page carefully and it will tell you how to modify the header of a bam file that contains reads from different sample, libraries, or readgroups Link:http://www.broadinstitute.org/gatk/guide/topic?name=best-practices. But I think you should focus on resorting. Also, you will have to tell me more about your data in order to get a specific answers.

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k

score 1 · Answer 2 · 2013-05-21

We also have SOLID's and use GATK. I would also use human reference from GATK to avoid any of the wrong reference order or wrong reference name errors during downstream processing. This works here.

I use lifescope shell jobs to start mapping on lifescope and I set the GATK human reference like this:

set reference /data_disk/referenceData/external/human_GRCh37_gatk/GRCh37_gatk.fasta

Another issue is that single fragment targeted sequencing on a solid generates a lot of duplicates (often 80% here ). These are pcr artifact and duplicates by chance because of the short read length of the solid. You can remove these, but the coverage and probably the no genotype area of your design will seriously drop. Or you can just leave the duplicates unmarked, coverage will stay high but GATK will do a slightly less good job at calling SNP's and reference because of the duplicates.

Indel recalibration works. I couldn't get quality score recalibration to work here on solid data, so I just skip it. You might need to fix the bam headers (sample and library name) depending on what to sequencing operators set on the sequencing run and got set in the XSQ. Or this might something wrong with lifesocpe. I never got to the bottom of it.

score 0 · Answer 3 · 2013-05-22

we wanted to make sure that the reference we use would be the most appropriate one, and for that reason we built it ourselves from each chromosome's fasta. once we knew how this reference was exactly build, we only had to reorder the GATK's bundle files. of course the easiest way would be to use GATK suggested reference (GRCh37) when mapping with bioscope/lifescope, and then use all the GATK's bundle files straight away. if that is not your case, and you are using a particular reference that you want to keep, you'll definitely have to edit the bundle files in some similar way as we did. for instance, if you've used bioscope's default hg18 ("only" chr1-22, X, Y and MT) to generate your bam files you'll have to either remap them against any of the GATK references or either edit the bundle files by placing chrMT's entries at the end as we did.

regarding bam headers editing, I have to agree that all the bam files we've generated with bioscope and lifescope have always lacked from the strict compatibility that GATK and many other tools require. for that reason we always have to edit their headers to record the sample and group name appropriately, specially when merging different lane's bam files into a single one. apart from this issue, the bam files that bioscope/lifescope generates are perfectly compatible with the analysis pipeline described at GATK best practice, and that has been detailed in the SeqAnswers' wiki under "exome analysis" for a long time. in fact that's precisely the one we use right after bioscope/lifescope ends, with that mentioned header editing and some initial filtering (duplicates removal and mapping quality = 0) to reduce file sizes and to improve the final variant calling.