BCFTOOLS: One-liner to get individuals with more than one heterozygote call in a region
1
0
Entering edit mode
4.6 years ago
tacrolimus ▴ 140

Hello!

I have a regular multi-sample VCF from which I am looking to find individuals that have >1 HET call in the same gene. The vcf is chunked so relatively small in size (with the gene of interest in the chunk)/

Does anyone have a one liner for such a task with bcftools or even R?

Many thanks

bcftools • 1.5k views
ADD COMMENT
0
Entering edit mode

multiple individuals merged into one across a single gene

VCF format does not work that way. How did you get to this from a real valid VCF?

ADD REPLY
0
Entering edit mode

Yup - its a multi sample vcf chunked by region of the genome.

ADD REPLY
0
Entering edit mode

Yup

Yes to what, exactly?

it's a multi-sample VCF

That is not the same thing as saying "multiple individuals merged into one". Please be clear - is it a multi-sample VCF or some sort of mutated file with multiple GTs per sample?

ADD REPLY
0
Entering edit mode

Dear RamRS,

It is a multi-sample vcf. Whilst it may contain mutations I can assure you it has not been mutated.

ADD REPLY
1
Entering edit mode

Thank you for clarifying that. Your initial statement "I have a vcf with multiple individuals merged into one across a single gene" is quite confusing as individuals are not "merged", they are merely part of the VCF, and I'm also assuming loci are not merged (the "across a single gene" part). What you have is a regular VCF from which you are looking to find individuals that have >1 HET call in the same gene.

ADD REPLY
0
Entering edit mode

Correct! Apologies if that was too vague. I have amended by question accordingly.

ADD REPLY
1
Entering edit mode
4.6 years ago

using http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html to create a BED tools + bcftools to get the variants. (you could get some false positives if an indel overlap a position )

bcftools view --targets-file <(java -jar dist/bioalcidaejdk.jar -e 'final Map<String,List<Interval>> sample2ctx = new HashMap<>();while(iter.hasNext()) { final VariantContext v= iter.next(); for(final Genotype g:v.getGenotypes()) { if(!g.isHet()) continue; List<Interval> list = sample2ctx.get(g.getSampleName()); if(list==null) { list = new ArrayList<>(); sample2ctx.put(g.getSampleName(),list); } list.add(new Interval(v)); } }for(final String sn: sample2ctx.keySet()) { List<Interval> list = sample2ctx.get(sn); if(list.size()<2) continue; for(Interval r:list) { print(r.getContig()); print("\t"); print(r.getStart()-1); print("\t"); print(r.getEnd()); print("\t"); println(sn); }}'  src/test/resources/rotavirus_rf.vcf.gz) src/test/resources/rotavirus_rf.vcf.gz
ADD COMMENT
0
Entering edit mode

That's quite the one liner! Thanks I'll give it a go. All the best.

ADD REPLY
0
Entering edit mode

Pierre, I think the code you've written here is complicated enough to warrant an explainer. How is your java code picking regions to use with the --targets-file parameter?

ADD REPLY
0
Entering edit mode

the java code generate a bed file. It's used as a file using a proc substitution https://www.tldp.org/LDP/abs/html/process-sub.html

ADD REPLY
0
Entering edit mode

Right, I'm clear on the process sub part - just wondering how the bed file is created. In other words, can you describe the code in English please?

ADD REPLY
1
Entering edit mode
create an associative array [sample,list[interval]]
for each variant
   for each genotype
       if the genotype is het add the interval for this sample

for each sample
   print each interval as bed if the number is > 1
ADD REPLY

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6