Question: Estimating cross contamination in a set of BAMS
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

Hi all,

I've received a set of BAM files , the variant were called with bcftools

    ${bcftools_exe} mpileup -Ou -f "${REF}" \
            --bam-list "${bam_list}" \
            --regions-file "${bedfile}" \
            --annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR'  \
            --redo-BAQ --adjust-MQ 50  --min-MQ 30  |\
    ${bcftools_exe} call \
            --ploidy GRCh37 \
            --multiallelic-caller \
            --variants-only -O z -o "output.vcf.gz"

but I suspect there is a cross-contamination between the sample, because many of the HOM_REF genotypes contain a few ALT allele.

The variants were called with samtools, but some genotypes called as HOM_REF contain a few ALT

 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | Sample  | Type    | AD     | ADF   | ADR   | DP  | GT  | PL        | SP |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | 28D0609 | HOM_REF | 206,15 | 97,9  | 109,6 | 221 | 0/0 | 0,255,255 | 4  |
 | 37D1676 | HOM_REF | 154,10 | 89,5  | 65,5  | 164 | 0/0 | 0,229,255 | 1  |
 | 13D0720 | HET     | 170,59 | 92,27 | 78,32 | 229 | 0/1 | 134,0,255 | 5  |
 | 37D1631 | HOM_REF | 155,16 | 73,8  | 82,8  | 171 | 0/0 | 0,76,255  | 0  |
 | 57D1188 | HOM_REF | 85,0   | 39,0  | 46,0  | 85  | 0/0 | 0,255,255 | 0  |
 | 14D2313 | HOM_REF | 101,0  | 50,0  | 51,0  | 101 | 0/0 | 0,255,255 | 0  |
 | 24D2314 | HOM_REF | 48,0   | 18,0  | 30,0  | 48  | 0/0 | 0,144,255 | 0  |
 | 24D0430 | HOM_REF | 64,0   | 31,0  | 33,0  | 64  | 0/0 | 0,193,255 | 0  |
 | 18D0610 | HOM_REF | 55,0   | 29,0  | 26,0  | 55  | 0/0 | 0,166,255 | 0  |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+

Some samples were sequenced in the same flowcell/lane.

How can I validate the hypothesis of a cross contamination ?

I was suggested to use verifyBamID but as far as I understand, It need another VCF called with another method (?)

I also tried to use Gatk ContEst but I've no idea of what I'm doing...

 java -ja GenomeAnalysisTK.jar -T ContEst -I bam.list -R human_g1k_v37.fasta -o out.metrics  --genotypes my.vcf.gz -pf  1000G_phase1.snps.high_confidence.b37.vcf --min_genotype_depth 20 -L 22


INFO  10:17:00,850 ContEst - Total sites:  31803838 
INFO  10:17:00,860 ContEst - Population informed sites:  310728 
INFO  10:17:00,861 ContEst - Non homozygous variant sites: 310728 
INFO  10:17:00,861 ContEst - Homozygous variant sites: 0 
INFO  10:17:00,861 ContEst - Passed coverage: 0 
INFO  10:17:00,861 ContEst - Results: 0

any suggestion ?

contamination bam • 431 views
ADD COMMENTlink modified 5 months ago by igor7.1k • written 5 months ago by Pierre Lindenbaum115k
1

I was also suggested to look for rare variants: they should not be found in unrelated samples.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum115k

Ideally if original samples are available then doing independent SNP genotyping would be the way to verify identity of samples.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax59k

verifyBamID does need a vcf, but it is a population reference VCF (1000genomes)

I've used it for detecting contamination in a targeted panel with alright results. see my question on their user group page.

reference_vcf=/media/sf_BigShare/SCID/180213_TSCA_r1_sop_test/work/reference/180124-1000G_phase1.snps.high_confidence.hg19.intersected_w_scid.vcf
./verifyBamID --vcf $reference_vcf --bam $bam --out $out --maxDepth 1000 --precise --ignoreRG
ADD REPLYlink written 5 months ago by Robert Sicko540

from twitter:

ADD REPLYlink written 5 months ago by Pierre Lindenbaum115k
3
gravatar for igor
5 months ago by
igor7.1k
United States
igor7.1k wrote:

A really nice method is GATK CalculateContamination and gives you an exact contamination estimate. It works if you have WGS/WES data (to provide sufficient coverage for enough SNPs). They provide a reference VCF for human genome. It needs to be in a specific format, so can be tricky to generate for other species, especially since population frequencies may not be known.

I've been using Bamkin, which is fairly simple and crude, but seems to work sufficiently well to detect sample mixups. I processed hundreds of samples and it's always been clear when some of them are problematic, at least when you have multiple samples from supposedly the same individual. The nice thing is it will work with smaller targeted panels or ChIP-seq or RNA-seq. You can also tell if the contamination is coming from other samples in the same batch of samples.

ADD COMMENTlink modified 5 months ago • written 5 months ago by igor7.1k
1

I was going to suggest CalculateContamination. For more detail about the method see also section VI of mutect.pdf.

ADD REPLYlink modified 5 months ago • written 5 months ago by dariober9.7k

I didn't realize there is a manual for CalculateContamination. That's helpful.

ADD REPLYlink written 5 months ago by igor7.1k
1

Ran into some error Key when getting my ExAC variant vcf ready for GATK4 CalculateContamination: AC_Adj0_Filter found in VariantContext field FILTER at chr

If someone is using ExAC vcf's for common variants, check Sheilas last post on: https://gatkforums.broadinstitute.org/gatk/discussion/8181/gatk-selectvariants-on-vcf

Program works fine. I spiked 2.6% reads from another sample in my FASTQ and GATK detected 3.5% contamination. Thanks for mentioning BamKin, looks really straight forward!

ADD REPLYlink written 7 weeks ago by jan.rehker10
1
gravatar for WouterDeCoster
5 months ago by
Belgium
WouterDeCoster35k wrote:

In my limited experience, an easy method to investigate cross-contamination is to look at unexpected deviations of alignment to the sex chromosomes, which only works if the samples are contaminated by samples from the opposite sex.

ADD COMMENTlink written 5 months ago by WouterDeCoster35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1281 users visited in the last hour