As a user and developer I am grateful for all the efficient, clever and really helpful tools and pipelines put out there by the community. I'd like to use this page to suggest any open questions from my point of view, i.e gaps in the field where talented developers might want to try their hand and build something new. We have enough short read aligners for example, and don't really need more.
It would be interesting to see what others think, so please contribute - this is an incomplete list.
Pangenomes
Pangenome short read alignment vs pangenomic reference sequences constructed by PGGB
Problem : the most typical use case for pangenomics is to build a pangenome from finished high quality ref sequences with a tool like PGGB, and then map reads against it and call SNPs. This is not too well developed yet - I struggle to map 2 million reads against a complex GFA file from Arabidopsis over 24+ hours. Which vg giraffe (short reads only) and GraphAligner (long and short reads) do exist, especially vg giraffe is too inefficient against these complex graphs. The read mapping is necessary if I want to genotype (or even call new SNPs, SVs) short read datasets against a pangenome.
Pangenome SNP and indel calling
AFAIK only vg call, Paragraph and GraphTyper exist (please add more). vg call is the most popular to my knowledge, but cannot perform multi-sample SNP calling. There is a big gap for potential new tools here.
Genomes
Short read alignments vs multiple haplotypes/different genomes
AFAIK there are no linear reference mapping tools and concepts which can map to a diploid genome (human 6 GB, not 3GB) with both haplotypes included in the fasta reference. In future, are we going to have to map everything to pangenomes?
bwa mem is "Alt-aware" for patched regions of the human genome only, where variant calling performance can be improved by using the patched version. However, mapping quality filters lead to major problems due to ignored/masked reads from whole regions when using these patches/alts for other applications, such as chip-seq, rna-seq etc.
Most tools/databases are tested/annotated/compiled for
model
organisms. There are plenty of other organisms that are used by small number of labs which are left out in the cold in terms of standard resources e.g. ways of dealing with data, assumptions about genome composition, variant calls for populations etc.So helping these overlooked organisms/researchers will likely remain an open problem for a long time to come.