Indel size seemingly capped by software?
0
0
Entering edit mode
21 months ago
mrmrwinter ▴ 30

Hi,

I'm trying to detect indels between pairs of homeologous scaffolds in a genome assembly.

My method so far is as follows:

1, Aligned pairs with nucmer

"nucmer -p " + mapping + row[0] + "." + row[1] + " " + seq1 + " " + seq2

2, Generated variant information with mummer show-snps

"show-snps -T " + delta + " > " + snps

3, Converted this output to vcf

"mumSNPs2vcf.py --input-header --output-header --snps " + snps + " --reference " + ref + " --type INDEL > " + snps.replace(".svs",".vcf").replace("alignments","vcfs")

4, Ran the output into bcftools stats

"bcftools stats " + vcf + " > " + stats + pfx

5, Then generated plots

"plot-vcfstats " + vcf + " -p " + stats + "plots/" + pfx

My issue comes from that bcftools stats and the following plots only show indels up to 60 bp in size, when i can see clearly in synteny alignments that there are indels larger than this between each pair.

An example bcftools stats output looks like this:

# This file was produced by bcftools stats (1.14+htslib-1.14) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats  ../FINAL_ASSEMBLY/pairs/indels/vcfs/Scaffold_7.Scaffold_10.vcf
#
# Definition of sets:
# ID    [2]id   [3]tab-separated file names
ID  0   ../FINAL_ASSEMBLY/pairs/indels/vcfs/Scaffold_7.Scaffold_10.vcf
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that rows containing multiple types will be counted multiple times, in each
#   counter. For example, a row with a SNP and an indel increments both the SNP and
#   the indel counter.
# 
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  0
SN  0   number of records:  191914
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 141344
SN  0   number of MNPs: 0
SN  0   number of indels:   50570
SN  0   number of others:   0
SN  0   number of multiallelic sites:   1238
SN  0   number of multiallelic SNP sites:   1238
# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   75388   67211   1.12    74979   66365   1.13
# SiS, Singleton stats:
# SiS   [2]id   [3]allele count [4]number of SNPs   [5]number of transitions    [6]number of transversions  [7]number of indels [8]repeat-consistent    [9]repeat-inconsistent  [10]not applicable
SiS 0   1   142599  75388   67211   50570   0   0   50570
# AF, Stats by non-reference allele frequency:
# AF    [2]id   [3]allele frequency [4]number of SNPs   [5]number of transitions    [6]number of transversions  [7]number of indels [8]repeat-consistent    [9]repeat-inconsistent  [10]not applicable
AF  0   0.000000    142599  75388   67211   50570   0   0   50570
# QUAL, Stats by quality
# QUAL  [2]id   [3]Quality  [4]number of SNPs   [5]number of transitions (1st ALT)  [6]number of transversions (1st ALT)    [7]number of indels
QUAL    0   .   141344  74979   66365   50570
# IDD, InDel distribution:
# IDD   [2]id   [3]length (deletions negative)  [4]number of sites  [5]number of genotypes  [6]mean VAF
IDD 0   -60 5   0   .
IDD 0   -59 1   0   .
IDD 0   -55 1   0   .
IDD 0   -54 2   0   .
IDD 0   -53 1   0   .
IDD 0   -52 3   0   .
IDD 0   -50 1   0   .
IDD 0   -49 1   0   .
IDD 0   -47 1   0   .
IDD 0   -45 2   0   .
IDD 0   -44 2   0   .
IDD 0   -43 4   0   .
IDD 0   -42 1   0   .
IDD 0   -41 1   0   .
IDD 0   -39 7   0   .
IDD 0   -38 1   0   .
IDD 0   -37 3   0   .
IDD 0   -36 10  0   .
IDD 0   -35 2   0   .
IDD 0   -34 2   0   .
IDD 0   -33 1   0   .
IDD 0   -32 3   0   .
IDD 0   -31 6   0   .
IDD 0   -30 5   0   .
IDD 0   -29 8   0   .
IDD 0   -28 3   0   .
IDD 0   -27 8   0   .
IDD 0   -25 9   0   .
IDD 0   -24 10  0   .
IDD 0   -23 12  0   .
IDD 0   -22 8   0   .
IDD 0   -21 23  0   .
IDD 0   -20 24  0   .
IDD 0   -19 20  0   .
IDD 0   -18 16  0   .
IDD 0   -17 28  0   .
IDD 0   -16 26  0   .
IDD 0   -15 42  0   .
IDD 0   -14 53  0   .
IDD 0   -13 66  0   .
IDD 0   -12 86  0   .
IDD 0   -11 108 0   .
IDD 0   -10 122 0   .
IDD 0   -9  176 0   .
IDD 0   -8  251 0   .
IDD 0   -7  341 0   .
IDD 0   -6  518 0   .
IDD 0   -5  707 0   .
IDD 0   -4  1330    0   .
IDD 0   -3  2005    0   .
IDD 0   -2  3692    0   .
IDD 0   -1  17970   0   .
IDD 0   1   12090   0   .
IDD 0   2   4065    0   .
IDD 0   3   2273    0   .
IDD 0   4   1363    0   .
IDD 0   5   725 0   .
IDD 0   6   592 0   .
IDD 0   7   367 0   .
IDD 0   8   252 0   .
IDD 0   9   216 0   .
IDD 0   10  163 0   .
IDD 0   11  128 0   .
IDD 0   12  105 0   .
IDD 0   13  75  0   .
IDD 0   14  67  0   .
IDD 0   15  67  0   .
IDD 0   16  37  0   .
IDD 0   17  19  0   .
IDD 0   18  28  0   .
IDD 0   19  21  0   .
IDD 0   20  20  0   .
IDD 0   21  18  0   .
IDD 0   22  19  0   .
IDD 0   23  10  0   .
IDD 0   24  9   0   .
IDD 0   25  12  0   .
IDD 0   26  5   0   .
IDD 0   27  5   0   .
IDD 0   28  5   0   .
IDD 0   29  11  0   .
IDD 0   30  3   0   .
IDD 0   31  4   0   .
IDD 0   32  4   0   .
IDD 0   33  5   0   .
IDD 0   34  6   0   .
IDD 0   35  2   0   .
IDD 0   36  2   0   .
IDD 0   37  1   0   .
IDD 0   38  2   0   .
IDD 0   39  1   0   .
IDD 0   40  4   0   .
IDD 0   41  3   0   .
IDD 0   42  2   0   .
IDD 0   44  2   0   .
IDD 0   46  2   0   .
IDD 0   47  3   0   .
IDD 0   48  1   0   .
IDD 0   49  1   0   .
IDD 0   50  1   0   .
IDD 0   51  3   0   .
IDD 0   55  1   0   .
IDD 0   56  1   0   .
IDD 0   57  1   0   .
IDD 0   58  1   0   .
IDD 0   59  1   0   .
IDD 0   60  18  0   .
# ST, Substitution types:
# ST    [2]id   [3]type [4]count
ST  0   A>C 7647
ST  0   A>G 19372
ST  0   A>T 15832
ST  0   C>A 7470
ST  0   C>G 2887
ST  0   C>T 18093
ST  0   G>A 18511
ST  0   G>C 2800
ST  0   G>T 7380
ST  0   T>A 15648
ST  0   T>C 19412
ST  0   T>G 7547
# DP, Depth distribution
# DP    [2]id   [3]bin  [4]number of genotypes  [5]fraction of genotypes (%)    [6]number of sites  [7]fraction of sites (%)

What am i missing here? Are there settings in either nucmer of bcftools to output/detect indels larger than 60bp? I am expecting indels ranging up to and beyond tens of thousands of bp.

please excuse the python formatting; all of the flags are still present.

indels calling nucmer variant genome alignment • 419 views
ADD COMMENT
0
Entering edit mode

I have now repeated this with a few different aligners and the 60 bp indel cap is persisting. Reading bcftools stats documentation and there is mention of it missing larger indels due to a length cap being applied. There has been a flag added, --indel-size, to bcftools stats development version. I will try increasing the threshold and see if this calls larger indels.

ADD REPLY

Login before adding your answer.

Traffic: 2664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6