Applying Patches To Grch Assembly
5
3
Entering edit mode
11.9 years ago
Nikolay Vyahhi ★ 1.3k

There exists PATCHES in H. sapiens GRCh37 assembly. Some of them are "fix patches":

FIX patch: A patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release, and their accessions will then be deprecated.

How to automatically apply this patches to assembled genomic sequences (i.e. to assembled chromosomes from GRCh37) to get latest known sequences?

assembly human freeze genome • 5.6k views
ADD COMMENT
3
Entering edit mode
11.7 years ago
lh3 33k

To me, patches are just for a historical record, but not intended for practical use. If you integrate them into the primary assembly, all the coordinates will be shifted. Then your results cannot be easily compared to others. If you treat them as separate contigs, the massive redundancy with the primary assembly will lead to loss of information around patches, which is also problematic.

If you are worrying about the misassemblies in the reference genome messing up your analysis, you should really use decoy sequences.

ADD COMMENT
3
Entering edit mode

The decoy sequences will not help if there is a mis-assembly in the Primary. Rather, the decoy sequences are there to help decrease off-target alignments due to sequences missing from the Primary assembly.

ADD REPLY
3
Entering edit mode

It depends on how we define "help" and "misassembly". I know multiple examples where the reference genome collapses two or more copies of a sequence into one (I call this a misassembly), which lead to spurious variants. Decoy helps to fix many of them, the false positives. Also, about 90% of decoy sequences are fixed in the entire human population. Missing these sequences is also a type of misassembly. That said, I really appreciate GRC for the phenomenal works on the human reference genome. I know getting a good genome is really really hard.

ADD REPLY
1
Entering edit mode

I was not saying the decoy is not useful- I'm merely commenting on the fact that it doesn't fix the misassembly- it only helps by soaking up reads so that you don't get off target alignments. The FIX patches are actually meant to 'fix' these problems (although granted- all of them are not fixed)

ADD REPLY
2
Entering edit mode
11.7 years ago
deanna.church ★ 1.1k

Currently, it is challenging to use the patches. Tools are being developed to make better use of the data, but they are not quite ready for prime time. Jeremy and JC are correct- you can integrate the sequences into the assembly using the information the GRC distributes but you will create chromosome coordinates that don't exist anywhere beyond your computer. However, if this is just an analysis intermediate and you map features back to the native (that is scaffold) coordinates then you would be fine.

Note: there are two types of patches- the FIX patches are regions where there is a mis-assembly and the Primary assembly (i.e. the chromosome assembly) will change when GRCh38 is released next year. The NOVEL patches actually represent places where the underlying chromosome assembly seems to be correct, but an additionally allele that adds more sequence has been found.

There is an aligner that can use the assembly structure (that is the placement file that provides the correspondence of the patches/alts to the assembly) and provide alignments that don't get a lower mapping score: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/srprism

This aligner is not published yet, but a manuscript is in preparation.

ADD COMMENT
1
Entering edit mode
11.7 years ago

Patches are not normally intended to be applied to the primary sequence, otherwise the GRC would have provided tools to easily do so. You can certainly include the patch sequences in your genomic index for the purposes of alignment, but you will be working off the grid, so to speak.

ADD COMMENT
1
Entering edit mode
11.7 years ago
JC 13k

As Jeremy said, there are not tools to patch a genome because those regions are not intended to be integrated, but some "patching" can be done with a little scripting and the information in ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p8/PATCHES/alt_scaffolds/alt_scaffold_placement.txt

ADD COMMENT
0
Entering edit mode
8.7 years ago
shuoguo • 0

Do we have a better solution now?

We are struggle to find out if we can use GRCh38.p4 for mapping and variant calling. Since the ABO gene (and few others) were misassembled and corrected in GRCh38.p1.

When I look into this problem I found that:

  1. ABO gene sequence from the assembled chromosomes of GRCh38 and GRCh38.p4 are identical. So patch is not applied.
  2. The patch is 149 base pair longer than the chromosome region that receive this patch. So I cannot use the patch to replace the mis-assembled ABO gene.

Any comments? Should I just give up?

Thanks

ADD COMMENT

Login before adding your answer.

Traffic: 1893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6