Question

Several assemblies to reference

1

Entering edit mode

4.6 years ago

kamel ▴ 70

I have 151 fasta assemblies (contigs) corresponding to the fungal genome with a size of 50 Mb, including the reference sequence and I would like to perform a pangenome in order to have a single fasta sequence combining the 151 strains.

At first I mapped each of these assemblies to reference sequence then I extracted the bam file that contained unmapped reads from each stub, then I am stuck for the rest of the steps.

How is it possible to perform a single reference sequence based on the pangenome of these sequences?

Thank you, Kamel

assembly genome sequence alignment • 1.4k views

ADD COMMENT • link 4.6 years ago by kamel ▴ 70

1

Entering edit mode

You are reaching into the realm of genome graphs, a single space representing multiple genomes Look into the vg toolkit, cactus alignment and pggb It will require new whole genome alignments in order to generate the graph (cactus and pggb) which can then be used by vg in order to use the graph in pangenome analyses

ADD REPLY • link 4.6 years ago by samuel.a.odonnell ▴ 600

score 1 · Answer 1 · 2020-12-01

1

Entering edit mode

4.6 years ago

Istvan Albert 102k

The word pangenome is used to describe the superset of all genes across all strains of a species.

It is not some sort of consensus sequence that would be represented in a single sequence (if I understood correctly what you describe as your goal) You can't collapse the sequences into a single one, it would not be biologically meaningful.

ADD COMMENT • link 4.6 years ago by Istvan Albert 102k

0

Entering edit mode

This is not exactly what I wanted to do.

From the pangenome of these 151 genomes I found that this pangenome is of the open type, so there are sequences that are not present in the reference strain. I want to complete the reference sequence with the new sequences which are not present in the reference strain.

ADD REPLY • link 4.6 years ago by kamel ▴ 70

0

Entering edit mode

Even so the words need to be used correctly otherwise leads to confusion.

https://en.wikipedia.org/wiki/Pan-genome

pan-genome is about genes, not about merged, full genomic sequences. In addition, you cannot "complete" a reference by merging more strains into it. That would not be a reference sequence by the normal use of the word.

What we call reference is also usually called the "representative" sequence.

It is important that (as much as possible) the reference represents a functional, living organism. It is not so obvious what utility an artificial, non-existing consensus sequence has, especially since it could constantly change as you add more strains.

The whole point of a reference is that it won't just change on a whim and that it is a meaningful biological sequence that is known to be "alive".

ADD REPLY • link 4.6 years ago by Istvan Albert 102k

1

Entering edit mode

I think you are correct in that having an unambiguous definition would be better and avoid unnecessary confusion, and, by being older, the "genes shared by the majority of lineages" pan-genome is the correct definition.

However, the people working on graph genomes are also calling the references incorporating the genetic variations available a pan-genome - and I would argue this application is better suited to "pan-genome" than "common gene set".

ADD REPLY • link 4.6 years ago by h.mon 35k