Question

how to extract centromeric regions from human assemblies/FASTA

0

Entering edit mode

3 months ago

Matteo Ungaro ▴ 130

Hi there,

I'm working on some analyses targeting specific chromosomes in human genome assemblies from a publicly available pedigree.

Now, I narrowed down a script that aligns sequences to a reference for then extracting and assigning their contigs to a specific chromosome (based on the reference) to, lastly, store them to independent FASTA files for downstream use.
E. g. for each individual I get 22 autosomes plus sex chromosomes stored in individual FASTA files containing the appropriate contigs.

I was wondering, as it might be useful for testing a pipeline on specific regions, is there a way to extract centromeric regions from each one of the chromosome FASTA for those assemblies, and potentially save them as separate FASTA files?
I've seen people mentioning RepeatMasker for similar tasks, but I never used the tools and I'm not familiar with what can actually do... any help is greatly appreciated, thanks in advance!

centromeres FASTA • 757 views

ADD COMMENT • link updated 3 months ago by colindaven 8.1k • written 3 months ago by Matteo Ungaro ▴ 130

score 1 · Answer 1 · 2025-08-02

1

Entering edit mode

3 months ago

GenoMax 154k

is there a way to extract centromeric regions from each one of the chromosome FASTA for those assemblies, and potentially save them as separate FASTA files?

The best bet may be T2T assembly for which the centromeric coordinates are provided in a table here: https://figshare.com/articles/journal_contribution/Genomic_coordinates_on_the_T2T-CHM13v1_0_reference_genome_that_define_the_boundaries_of_the_centromeric_regions_/20340511

This was for v.1.0 of the assembly so things may have changed some with the current v.2.0. You will no doubt need to use the T2T genome for your analysis.

As for detecting the centromeric regions in assemblies looks like you would use RepeatMasker (and others) and look for alpha satellite DNA present in repeated regions. Many short read assemblies likely won't have these regions or they would be poorly represented/incomplete.

ADD COMMENT • link 3 months ago by GenoMax 154k

0

Entering edit mode

@GenoMax Indeed, I agree with your last point; finger crossed I can get something out of these assemblies since they are quite fragmented.

Good thing is that I aligned to CHM13v2.0 so whatever happens to be identified as centromere in those assemblies should be reported; thing is looking into RepeatMasker it isn't intuitive to me how to combine potential BED information with chromosome FASTA files (made of contigs) to extract the centromere sequence.

Many thanks for confirming at least I'm on the right track, if you happen to have experience with the tool let me know. Thanks again!

P. S. looking at the help of RepeatMasker I couldn't find any specific config for centromeres... should I simply plug in the alpha sat sequence somehow for the engine to search for?

ADD REPLY • link 3 months ago by Matteo Ungaro ▴ 130

1

Entering edit mode

I can get something out of these assemblies since they are quite fragmented.

Are these your own assemblies?

I aligned to CHM13v2.0

Which program did you use? That alignment should give you some idea if these regions are present in your assemblies. Since you know the coordinates of centromeres are there alignments close to those regions.

repeatmasker is going to mark repeat regions. You will need to see if there are regions of tandem repeats. Not done this myself.

Chances are that these regions are not represented in your data, if your assemblies were made from only short reads.

ADD REPLY • link 3 months ago by GenoMax 154k

score 0 · Answer 2 · 2025-08-04

@GenoMax no they are from the platinum genome pedigree assembled with HiFi + Hi-C + ONT UL, so theoretically they should represent to some extent also centromeric sequences.

While digging up more I found this tool dna-nn which seems a better/clearer implement of RepeatMasker for which neither the docs nor the --help seem to be very intuitive. It is tailor-suited for centromere and AATTC repeats identification.

It produces a BED that can be edited and used to extract centromeric regions from a FASTA with bedtools.