Question: Estimating cross contamination in a set of BAMS
1
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

Hi all,

I've received a set of BAM files , the variant were called with bcftools

    ${bcftools_exe} mpileup -Ou -f "${REF}" \
            --bam-list "${bam_list}" \
            --regions-file "${bedfile}" \
            --annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR'  \
            --redo-BAQ --adjust-MQ 50  --min-MQ 30  |\
    ${bcftools_exe} call \
            --ploidy GRCh37 \
            --multiallelic-caller \
            --variants-only -O z -o "output.vcf.gz"

but I suspect there is a cross-contamination between the sample, because many of the HOM_REF genotypes contain a few ALT allele.

The variants were called with samtools, but some genotypes called as HOM_REF contain a few ALT

 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | Sample  | Type    | AD     | ADF   | ADR   | DP  | GT  | PL        | SP |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | 28D0609 | HOM_REF | 206,15 | 97,9  | 109,6 | 221 | 0/0 | 0,255,255 | 4  |
 | 37D1676 | HOM_REF | 154,10 | 89,5  | 65,5  | 164 | 0/0 | 0,229,255 | 1  |
 | 13D0720 | HET     | 170,59 | 92,27 | 78,32 | 229 | 0/1 | 134,0,255 | 5  |
 | 37D1631 | HOM_REF | 155,16 | 73,8  | 82,8  | 171 | 0/0 | 0,76,255  | 0  |
 | 57D1188 | HOM_REF | 85,0   | 39,0  | 46,0  | 85  | 0/0 | 0,255,255 | 0  |
 | 14D2313 | HOM_REF | 101,0  | 50,0  | 51,0  | 101 | 0/0 | 0,255,255 | 0  |
 | 24D2314 | HOM_REF | 48,0   | 18,0  | 30,0  | 48  | 0/0 | 0,144,255 | 0  |
 | 24D0430 | HOM_REF | 64,0   | 31,0  | 33,0  | 64  | 0/0 | 0,193,255 | 0  |
 | 18D0610 | HOM_REF | 55,0   | 29,0  | 26,0  | 55  | 0/0 | 0,166,255 | 0  |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+

Some samples were sequenced in the same flowcell/lane.

How can I validate the hypothesis of a cross contamination ?

I was suggested to use verifyBamID but as far as I understand, It need another VCF called with another method (?)

I also tried to use Gatk ContEst but I've no idea of what I'm doing...

 java -ja GenomeAnalysisTK.jar -T ContEst -I bam.list -R human_g1k_v37.fasta -o out.metrics  --genotypes my.vcf.gz -pf  1000G_phase1.snps.high_confidence.b37.vcf --min_genotype_depth 20 -L 22


INFO  10:17:00,850 ContEst - Total sites:  31803838 
INFO  10:17:00,860 ContEst - Population informed sites:  310728 
INFO  10:17:00,861 ContEst - Non homozygous variant sites: 310728 
INFO  10:17:00,861 ContEst - Homozygous variant sites: 0 
INFO  10:17:00,861 ContEst - Passed coverage: 0 
INFO  10:17:00,861 ContEst - Results: 0

any suggestion ?

contamination bam • 326 views
ADD COMMENTlink modified 3 months ago by igor6.6k • written 3 months ago by Pierre Lindenbaum112k
1

I was also suggested to look for rare variants: they should not be found in unrelated samples.

ADD REPLYlink written 3 months ago by Pierre Lindenbaum112k

Ideally if original samples are available then doing independent SNP genotyping would be the way to verify identity of samples.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax56k

verifyBamID does need a vcf, but it is a population reference VCF (1000genomes)

I've used it for detecting contamination in a targeted panel with alright results. see my question on their user group page.

reference_vcf=/media/sf_BigShare/SCID/180213_TSCA_r1_sop_test/work/reference/180124-1000G_phase1.snps.high_confidence.hg19.intersected_w_scid.vcf
./verifyBamID --vcf $reference_vcf --bam $bam --out $out --maxDepth 1000 --precise --ignoreRG
ADD REPLYlink written 3 months ago by Robert Sicko540

from twitter:

ADD REPLYlink written 3 months ago by Pierre Lindenbaum112k
3
gravatar for igor
3 months ago by
igor6.6k
United States
igor6.6k wrote:

A really nice method is GATK CalculateContamination and gives you an exact contamination estimate. It works if you have WGS/WES data (to provide sufficient coverage for enough SNPs). They provide a reference VCF for human genome. It needs to be in a specific format, so can be tricky to generate for other species, especially since population frequencies may not be known.

I've been using Bamkin, which is fairly simple and crude, but seems to work sufficiently well to detect sample mixups. I processed hundreds of samples and it's always been clear when some of them are problematic, at least when you have multiple samples from supposedly the same individual. The nice thing is it will work with smaller targeted panels or ChIP-seq or RNA-seq. You can also tell if the contamination is coming from other samples in the same batch of samples.

ADD COMMENTlink modified 3 months ago • written 3 months ago by igor6.6k
1

I was going to suggest CalculateContamination. For more detail about the method see also section VI of mutect.pdf.

ADD REPLYlink modified 3 months ago • written 3 months ago by dariober9.4k

I didn't realize there is a manual for CalculateContamination. That's helpful.

ADD REPLYlink written 3 months ago by igor6.6k
1
gravatar for WouterDeCoster
3 months ago by
Belgium
WouterDeCoster32k wrote:

In my limited experience, an easy method to investigate cross-contamination is to look at unexpected deviations of alignment to the sex chromosomes, which only works if the samples are contaminated by samples from the opposite sex.

ADD COMMENTlink written 3 months ago by WouterDeCoster32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour