Question

Estimating cross contamination in a set of BAMS

1

Entering edit mode

5.9 years ago

Pierre Lindenbaum 161k

Hi all,

I've received a set of BAM files , the variant were called with bcftools

    ${bcftools_exe} mpileup -Ou -f "${REF}" \
            --bam-list "${bam_list}" \
            --regions-file "${bedfile}" \
            --annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR'  \
            --redo-BAQ --adjust-MQ 50  --min-MQ 30  |\
    ${bcftools_exe} call \
            --ploidy GRCh37 \
            --multiallelic-caller \
            --variants-only -O z -o "output.vcf.gz"

but I suspect there is a cross-contamination between the sample, because many of the HOM_REF genotypes contain a few ALT allele.

The variants were called with samtools, but some genotypes called as HOM_REF contain a few ALT

 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | Sample  | Type    | AD     | ADF   | ADR   | DP  | GT  | PL        | SP |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | 28D0609 | HOM_REF | 206,15 | 97,9  | 109,6 | 221 | 0/0 | 0,255,255 | 4  |
 | 37D1676 | HOM_REF | 154,10 | 89,5  | 65,5  | 164 | 0/0 | 0,229,255 | 1  |
 | 13D0720 | HET     | 170,59 | 92,27 | 78,32 | 229 | 0/1 | 134,0,255 | 5  |
 | 37D1631 | HOM_REF | 155,16 | 73,8  | 82,8  | 171 | 0/0 | 0,76,255  | 0  |
 | 57D1188 | HOM_REF | 85,0   | 39,0  | 46,0  | 85  | 0/0 | 0,255,255 | 0  |
 | 14D2313 | HOM_REF | 101,0  | 50,0  | 51,0  | 101 | 0/0 | 0,255,255 | 0  |
 | 24D2314 | HOM_REF | 48,0   | 18,0  | 30,0  | 48  | 0/0 | 0,144,255 | 0  |
 | 24D0430 | HOM_REF | 64,0   | 31,0  | 33,0  | 64  | 0/0 | 0,193,255 | 0  |
 | 18D0610 | HOM_REF | 55,0   | 29,0  | 26,0  | 55  | 0/0 | 0,166,255 | 0  |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+

Some samples were sequenced in the same flowcell/lane.

How can I validate the hypothesis of a cross contamination ?

I was suggested to use verifyBamID but as far as I understand, It need another VCF called with another method (?)

I also tried to use Gatk ContEst but I've no idea of what I'm doing...

 java -ja GenomeAnalysisTK.jar -T ContEst -I bam.list -R human_g1k_v37.fasta -o out.metrics  --genotypes my.vcf.gz -pf  1000G_phase1.snps.high_confidence.b37.vcf --min_genotype_depth 20 -L 22


INFO  10:17:00,850 ContEst - Total sites:  31803838 
INFO  10:17:00,860 ContEst - Population informed sites:  310728 
INFO  10:17:00,861 ContEst - Non homozygous variant sites: 310728 
INFO  10:17:00,861 ContEst - Homozygous variant sites: 0 
INFO  10:17:00,861 ContEst - Passed coverage: 0 
INFO  10:17:00,861 ContEst - Results: 0

any suggestion ?

contamination bam • 4.3k views

ADD COMMENT • link updated 5.9 years ago by igor 13k • written 5.9 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

I was also suggested to look for rare variants: they should not be found in unrelated samples.

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Ideally if original samples are available then doing independent SNP genotyping would be the way to verify identity of samples.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

verifyBamID does need a vcf, but it is a population reference VCF (1000genomes)

I've used it for detecting contamination in a targeted panel with alright results. see my question on their user group page.

reference_vcf=/media/sf_BigShare/SCID/180213_TSCA_r1_sop_test/work/reference/180124-1000G_phase1.snps.high_confidence.hg19.intersected_w_scid.vcf
./verifyBamID --vcf $reference_vcf --bam $bam --out $out --maxDepth 1000 --precise --ignoreRG

ADD REPLY • link 5.9 years ago by Robert Sicko ▴ 630

0

Entering edit mode

from twitter:

Conpair is great: https://t.co/5t8vUf1jLe It wont’t tell you who is contaminating who, but estimates contamination rate even if you haven’t sequenced the guilty one contaminating the others.
— Matthieu Foll (@m_foll) June 19, 2018

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

5.9 years ago

WouterDeCoster 47k

In my limited experience, an easy method to investigate cross-contamination is to look at unexpected deviations of alignment to the sex chromosomes, which only works if the samples are contaminated by samples from the opposite sex.

ADD COMMENT • link 5.9 years ago by WouterDeCoster 47k

score 5 · Accepted Answer · 2018-06-19

5

Entering edit mode

5.9 years ago

igor 13k

A really nice method is GATK CalculateContamination and gives you an exact contamination estimate. It works if you have WGS/WES data (to provide sufficient coverage for enough SNPs). They provide a reference VCF for human genome. It needs to be in a specific format, so can be tricky to generate for other species, especially since population frequencies may not be known.

I've been using Bamkin, which is fairly simple and crude, but seems to work sufficiently well to detect sample mixups. I processed hundreds of samples and it's always been clear when some of them are problematic, at least when you have multiple samples from supposedly the same individual. The nice thing is it will work with smaller targeted panels or ChIP-seq or RNA-seq. You can also tell if the contamination is coming from other samples in the same batch of samples.

ADD COMMENT • link 5.9 years ago by igor 13k

1

Entering edit mode

I was going to suggest CalculateContamination. For more detail about the method see also section VI of mutect.pdf.

ADD REPLY • link 5.9 years ago by dariober 14k

0

Entering edit mode

I didn't realize there is a manual for CalculateContamination. That's helpful.

ADD REPLY • link 5.9 years ago by igor 13k

1

Entering edit mode

Ran into some error Key when getting my ExAC variant vcf ready for GATK4 CalculateContamination: AC_Adj0_Filter found in VariantContext field FILTER at chr

If someone is using ExAC vcf's for common variants, check Sheilas last post on: https://gatkforums.broadinstitute.org/gatk/discussion/8181/gatk-selectvariants-on-vcf

Program works fine. I spiked 2.6% reads from another sample in my FASTQ and GATK detected 3.5% contamination. Thanks for mentioning BamKin, looks really straight forward!

ADD REPLY • link 5.5 years ago by jan.rehker ▴ 10