Is there a 1000GP pangenome available?
16 months ago
בת אל • 0

Hi, I'm new to variation graphs and want to explore new alignment technologies of low-coverage ancient sequences (in FASTQ).

Usually, ancient sequences are linearly aligned to hg19 using BWA.

Is there a suitable pangenome graph alternative available? or do I have to create one myself from the 1000G reference panel, for example?

Thanks, B.

vg alignment pangenome low-coverage map • 1.3k views
16 months ago
Randy H ▴ 110

Not sure any standard tool can accept a pangenome graph model yet. Or that it is really defined in any final, formal way. Appears the pangenome is still in the early research and formulation stage with tools and equipment to follow.

See HPRC for more information on a major, funded project. Especially scroll to the bottom of the home page. Their ultimate goal is only 350 samples; with the first 200 being trios from the 1KGP cell lines. But then adding more to try and fill in with missed, isolated populations that were not sampled yet.

HPRC have released year 1 data of 47 cell samples (in various stages of completion). See HPRC Data and Tool repository on Github for the tool pipeline, year 1 data. Likely the best, detailed overview is their paper (preprint last July; don't recall that the final has come out yet)

Once a pangenome model is done (or a more final draft available), it would be great to see if Ancient DNA groups like the Reich lab could use the model or even the tools to expand the graph reference model using artifact DNA. But maybe most aDNA degraded so not enough of a complete genome can be ascertained.

The long read sequencing technology needed for this work is just now being commercially released. But the read length and accuracy to make this de novo model generation practical and automatic is not there yet; it appears. Currently, a lot of manual tuning in the labs today. UPDATE: New tool Verkko to automate the T2T process just published today.

The T2T consortium (another funded, multi-organization project), which is inter-related with HPRC, has published and released the first full genome assembly (in linear form). But it used the HPGP HG002 Y added to their pioneering work on the CHM13 haploid autosome and X cell line. X and Y of HG002 were the furthest along in HPRC year 1 data. See T2T Consortium to follow further. I think the complete T2T of HG002 is in final stages of QC from what I read in another post. That will be the first, single human diploid T2T model. Only 349 more to go after that :)

Can you really create a complete genome model (using de novo assembly) from short read sequencing that is available from the 1KGP? Or maybe you did not mean to imply that with your "create one yourself" comment.

There is a sort of graph based reference in the way DRAGEN has a custom reference with many more alt-contigs than the standard reference genomes. See DRAGEN demystifying genomes and DRAGEN Graph Mapper tool. Illumina and AWS have hosted a rerun of all the 1KGenome datasets through DRAGEN. But it is not clear if they used their Graph Reference model for that (and whether that is unwound before generating a final BAM or VCF to then only use the more standard linear reference with far fewer alt contigs (e.g. hs38DH). See the DRAGEN reanalysis of the 1KGenome Data Set on AWS Maybe this is what you wanted?

Randy, Thank you so much for your thorough explanations!

You definitely referred me to some very interesting sources I was not aware of.

As for your last paragraph - yes - I want to align ancient sequences in a non-linear fashion.

Unfortunately, DRAGEN is only available for hs38DH, and not for 37 (which is what ancient DNA is usually aligned to).

What actually motivated me to write my post is the Martiniano et al. Genome Biology (2020) paper, where they create a variation graph based on the 1000G, and align reads to it.

I want to try and do the same.

See and scroll to the Homo Sapiens section and the DRAGEN Graph section at the end of that. There are the Build 37 graph models in various forms. Note that the graph-aware is an alignment AND variant calling process. So you will have to realign any BAM/CRAM/SAM file to use these new references. Or start from FASTQs.

10 months ago

You might be interested in the 1000 GP pangenome graphs that were constructed for the analyses in this paper. The data resources are all available here:


