Workflow For Using Unified Genotyper For Bioscope Aligned Solid Targeted Capture Sequence
3
0
Entering edit mode
8.4 years ago
Occam ▴ 390

My group has generated Solid 4 targeted capture sequence with initial alignment using Solid's in-house Bioscope tool. We are trying to use GATK for variant calling (for consensus comparisons) but keep running into issues. After troubleshooting several issues, it seems that the fundamental problem is difference in sorting and labeling of chromosomes between Bioscope's BAMs and GATK's reference (although there may be other issues upstream of this one too).

The GATK developer's response to an earlier query was to realign the reads using GATK's reference and thus avoid all downstream problems; however, preliminary efforts twd this don't look promising as Bioscope doesn't like the GATK reference. This series of difficulties has raised the question, "Surely, someone else has performed GATK variant calling on Solid sequenced and aligned data?" So, we are curious if any one could point us a workflow for this process?

In the event that you do not know of a comprehensive workflow for the process, any ad hoc advice would also be helpful. Right now we considering two kludgey solutions to workaround. First, deleting all reads in our BAMs mapping to chrM, chrX and chrY, as our capture library included no sequence from these chr's and they are the ones ordered differently btwn our BAMs and the GATK reference. Second, reordering the bams using the Picard tool reordersam. Any thoughts on the viability of either of these approaches? Are you seeing anything we are missing?

gatk solid bam • 2.3k views
0
Entering edit mode
0
Entering edit mode

although I think that posting this question on the GATK forums would obtain the best or at least the most proper answer, posting it here or on SeqAnswers allows sharing experiences and finding out how other SOLiD users solved (or tried to solve) this type of common issues.

1
Entering edit mode
8.4 years ago

I have performed analysis where I aligned SOLiD reads using lifescope and then fed the BAM files as input to GATK. Make sure you modify your header as BAM files produced by Lifescope have meaningless header. They don't contain the information as it is required to be but this is not that important if you are not dealing with multiple libraries. I think reordering the bam files using Picard is the most preferred option. Check this link if you haven't: http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept . The chromosomes in BAM should have "chr" prefix attached and sorted in the official reference canonical order and your reference genome fasta file should have chromosomes sorted in the same way. Even I struggled for a bit doing what you are trying to do now but it was not that hard either.

0
Entering edit mode

Many thanks for the response. Ideally, we would like stick with the Picard BAM reordering approach. To be clear, are you suggesting that we need to clean up the BAM headers after reordering w/ Picard? If so, could you offer suggestions as to what specific header changes need to be made? Sorry for the naive questions. Any help is appreciated.

0
Entering edit mode

1
Entering edit mode
8.4 years ago
William ★ 4.9k

We also have SOLID's and use GATK. I would also use human reference from GATK to avoid any of the wrong reference order or wrong reference name errors during downstream processing. This works here.

I use lifescope shell jobs to start mapping on lifescope and I set the GATK human reference like this:

set reference /data_disk/referenceData/external/human_GRCh37_gatk/GRCh37_gatk.fasta

Another issue is that single fragment targeted sequencing on a solid generates a lot of duplicates (often 80% here ). These are pcr artifact and duplicates by chance because of the short read length of the solid. You can remove these, but the coverage and probably the no genotype area of your design will seriously drop. Or you can just leave the duplicates unmarked, coverage will stay high but GATK will do a slightly less good job at calling SNP's and reference because of the duplicates.

Indel recalibration works. I couldn't get quality score recalibration to work here on solid data, so I just skip it. You might need to fix the bam headers (sample and library name) depending on what to sequencing operators set on the sequencing run and got set in the XSQ. Or this might something wrong with lifesocpe. I never got to the bottom of it.

0
Entering edit mode

Very helpful. Do you know if this works with Bioscope as well as Lifescope?

0
Entering edit mode

I don't know. I used the latest version of lifescope. 2.5 something. And we use data from the 5500. An alternative route that is possible is to use an old version of bwa (0.5.9) that still supports colorspace. It's almost as good as lifescope but you might want to trim the reads for bad quality in the tails. Since bwa doesn't do as much clipping as lifescope. BWA can't do pairing of solid read's so this is only an option for single fragment data, or you need to implement a tool yourself for pairing the reads.

0
Entering edit mode
8.4 years ago

we wanted to make sure that the reference we use would be the most appropriate one, and for that reason we built it ourselves from each chromosome's fasta. once we knew how this reference was exactly build, we only had to reorder the GATK's bundle files. of course the easiest way would be to use GATK suggested reference (GRCh37) when mapping with bioscope/lifescope, and then use all the GATK's bundle files straight away. if that is not your case, and you are using a particular reference that you want to keep, you'll definitely have to edit the bundle files in some similar way as we did. for instance, if you've used bioscope's default hg18 ("only" chr1-22, X, Y and MT) to generate your bam files you'll have to either remap them against any of the GATK references or either edit the bundle files by placing chrMT's entries at the end as we did.

regarding bam headers editing, I have to agree that all the bam files we've generated with bioscope and lifescope have always lacked from the strict compatibility that GATK and many other tools require. for that reason we always have to edit their headers to record the sample and group name appropriately, specially when merging different lane's bam files into a single one. apart from this issue, the bam files that bioscope/lifescope generates are perfectly compatible with the analysis pipeline described at GATK best practice, and that has been detailed in the SeqAnswers' wiki under "exome analysis" for a long time. in fact that's precisely the one we use right after bioscope/lifescope ends, with that mentioned header editing and some initial filtering (duplicates removal and mapping quality = 0) to reduce file sizes and to improve the final variant calling.