vg: Mapping fewer reads to a genome graph than a linear reference
Entering edit mode
12 months ago
John Lloyd ▴ 30

I am seeing a 5-10% reduction in reads mapped to a genome graph compared to a linear reference of the same genome sequence. I have two graphs: one is the reference genome decorated with 5 million SNPs and the other is an empty graph created with vg using the reference genome sequence alone. Regardless of which graph or graph aligner I use (details below), I'm seeing similar results.

1) Is it unexpected to see fewer reads map to a genome graph than the linear reference, or am I misunderstanding? For example, is it known that vg map is less powerful than a linear aligner like bwa-mem for read mapping?

2) Are there useful parameters to adjust in vg's mapping tools to increase the # of mapped reads?

Species: Corn (Zea mays)

Genome: B73

SNPs: 5 million high quality SNPs (i.e. filtered to low false positive) from 100+ individuals

Read datasets: 150 bp, paired end, n=200 read datasets, low coverage Illumina DNA sequencing (0.1X, i.e. skim-seq)

Linear aligner: BWA

bwa index -p index_bwa genome.fa
bwa mem index_bwa reads_pe1.fq reads_pe2.fq > linear_align.sam

Genome graph 1: B73 decorated with 5 million SNPs

vg construct -r genome.fa -v snps_5M.vcf >
vg view -g > graph_snp.gfa #convert to gfa for indexing

Genome graph 2: B73, empty graph

vg construct -r genome.fa >
vg view -g > graph_empty.gfa #convert to gfa for indexing

Graph aligner 1: vg map

vg autoindex --prefix index_map -w map -g graph.gfa 
vg map -x index_map.xg -g index_map.gcsa -f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_map.gaf

Graph aligner 2: vg giraffe

vg autoindex --prefix index_giraffe -w giraffe -g graph.gfa
vg giraffe -Z index_giraffe.giraffe.gbz -m index_giraffe.min -d index_giraffe.dist \
    -f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_giraffe.gaf
giraffe mapping pangenome vg • 905 views
Entering edit mode
12 months ago

I learned recently from others in pangenomics who have done mapping with bwa that bwa tends to overattribute reads to a reference sequence. Bowtie2 does not, apparently, and this might be a better comparison

So your results are not unexpected for me.

That said, I find pangenome mappers to be in their infancy (at least in terms of speed in my hands), which has been admitted by pangenome tool devs, with the exception of GraphAligner for long reads.

Entering edit mode

In this case filtering for quality might help correct for this overattributing feature?


Login before adding your answer.

Traffic: 1534 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6