Question

Removing contamination with SNP tools

1

Entering edit mode

2.4 years ago

alebaars_98 ▴ 10

Hello everyone,

Currently, I'm working on a ChIPseq dataset where I will analyze chromatin marks on transposons and genes in a fungus. Unfortunately, I got some contamination in my data from a closely related species. Because they are so similar, removing contamination based on alignment quality is very unlikely to work since the differences are so small. The differences mostly consist of a single or a few nucleotide(s). With that in mind, we realized that searching for these locations could be treated as looking for a single nucleotide polymorphism. The problem here is that while there are many good tools to find SNPs, I cannot find anything that could remove reads containing one from my BAM file. Does anyone know of a tool that could do this? Or alternatively, is there another way to tackle this problem?

Thanks in advance for any help here. It's been nagging for a while.

filtering SNP BAM ChIPseq contamination • 1.2k views

ADD COMMENT • link 2.4 years ago by alebaars_98 ▴ 10

score 1 · Answer 1 · 2021-12-10

1

Entering edit mode

2.4 years ago

Pierre Lindenbaum 161k

I quickly wrote http://lindenb.github.io/jvarkit/Biostar9501110.html

usage:

java -jar dist/biostar9501110.jar --inverse -V index.vcf.gz input.bam

It should work for simple SNV, I didn't test for indels. Tell me if you think it's ok.

ADD COMMENT • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre,

Thank you for providing the script. I installed it and ran the test command which works fine, but on my own data, I am experiencing an issue:

alejandro@wildtype2:~:java -jar biostar9501110.jar --bamcompression 9 --inverse --samoutputformat BAM -R data/ref/unmasked/Verticillium_longisporum_VlPD589_HiC-improved_chromosomes.fasta -V data/SNPs/PD589/H3K27M3_A.vcf.gz data/alignment/masked/bwa/PD589/PD589_H3K27M3_A.dupl_rm.bam > data/alignment/masked/bwa/SNPs_filtered/PD589/PD589_H3K27M3_A.filtered.bam
[SEVERE][MultiBamLauncher]contig is null
java.lang.IllegalArgumentException: contig is null
        at com.github.lindenb.jvarkit.samtools.util.SimpleInterval.<init>(SimpleInterval.java:76)
        at com.github.lindenb.jvarkit.variant.vcf.BufferedVCFReader.query(BufferedVCFReader.java:122)
        at htsjdk.variant.vcf.VCFReader.query(VCFReader.java:60)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.findVariants(Biostar9501110.java:171)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.lambda$createSAMRecordFunction$1(Biostar9501110.java:201)
        at com.github.lindenb.jvarkit.jcommander.OnePassBamLauncher.scanIterator(OnePassBamLauncher.java:142)
        at com.github.lindenb.jvarkit.jcommander.OnePassBamLauncher.processInput(OnePassBamLauncher.java:153)
        at com.github.lindenb.jvarkit.jcommander.MultiBamLauncher.doWork(MultiBamLauncher.java:245)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:796)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:959)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.main(Biostar9501110.java:207)
[INFO][Launcher]biostar9501110 Exited with failure (-1)

It looks like something is wrong with my input. I'll describe how I got it, and let me know if there's anything I should do differently.

The BAM files were obtained by aligning with BWA-MEM and removing the duplicates with picard MarkDuplicates. They are sorted and indexed. The VCF files were generated with bcftools mpileup for specific regions of interest. They were bgzipped (with index created) ad then indexed using tabix. No other parameters were tweaked.

ADD REPLY • link 2.4 years ago by alebaars_98 ▴ 10

0

Entering edit mode

how _strange_ ... please, what is the output of

samtools idxstats data/alignment/masked/bwa/PD589/PD589_H3K27M3_A.dupl_rm.bam

and

tabix -l data/SNPs/PD589/H3K27M3_A.vcf.gz

please.

Please, use https://github.com/lindenb/jvarkit/issues for other questions.

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

ah no ! got it ! I forgot to test if the read was unmapped and all the reads are mapped in my test file. Give me a few minutes...

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

... and done ! I fixed the bug, can you please update the code and tell me if it works ?

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

It works now. To test the result, I quickly generated a new vcf file from the resulting BAM file. It does still contain indels, but they are rare and should not affect the results too much. All SNPs are gone. Thank you very much for the tool.

ADD REPLY • link 2.4 years ago by alebaars_98 ▴ 10