Question: Best way to find intervals for parallelization of joint variant calling
0
gravatar for William
8 weeks ago by
William4.7k
Europe
William4.7k wrote:

What is the best way to find intervals for parallelization of joint variant calling?

In general I can think of 3 ways:

  1. Split on poly-N regions in the reference genome. e.g. via picard ScatterIntervalsByNs
  2. Split on regions without mapped reads
  3. Split every 1Mbp, book-ended or overlapping

With improvements in sequencing quality it is becoming more difficult to find areas that are obviously safe to split the analysis on.

  • Less and smaller poly-N regions
  • More regions of the genome can have reads mapped (if using e.g. (long) reads of variable length)

This leads to the risk of not much parallelization (i.e. max per chromosome) or splitting right in the middle of a variant.

What is the best way to parallelize joint variant calling?

(e.g. GATK GenomicDBImport and GenotypeGVCFs on many GVCFs)?

Without losing variants or getting double variants?

variants parallel • 138 views
ADD COMMENTlink modified 8 weeks ago by Pierre Lindenbaum133k • written 8 weeks ago by William4.7k
0
gravatar for Pierre Lindenbaum
8 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

I wrote a tool named: FaidxSplitter http://lindenb.github.io/jvarkit/FaidxSplitter.html , I'm not sure if it fulfill your needs.

it takes as input:

  • the indexed reference genome
  • a BED of gaps where I don't want any calling
  • a BED of genes/regions where I don't want any 'cut'.
  • w the size of each region
  • x the overlap of between to region

the output is a BED file with all the regions to be called.

I then call the variants for each bed and I concatenate the vcf with bcftools, removing the duplicates.

ADD COMMENTlink written 8 weeks ago by Pierre Lindenbaum133k

I am currently stuck because for some reference genomes it is difficult to find the following in a systematic computational way

a BED of gaps where I don't want any calling

ADD REPLYlink written 8 weeks ago by William4.7k

as you said, you could use ScatterIntervalsByNs, use the regions with sharing 0 coverage or sharing a too high coverage. For human, there are some blacklisted regions defined by ENCODE.

ADD REPLYlink written 8 weeks ago by Pierre Lindenbaum133k

I prefer to use the fasta for this. BAM coverage analysis is not fully deterministic, it changes with the set of BAM files under analysis and is computational expensive if looking at many samples/bam files. Some non human model organism now have really high quality reference genomes without many (long) poly-N regions.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by William4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2171 users visited in the last hour
_