Hi there,
I'm working on some analyses targeting specific chromosomes in human genome assemblies from a publicly available pedigree.
Now, I narrowed down a script that aligns sequences to a reference for then extracting and assigning their contigs to a specific chromosome (based on the reference) to, lastly, store them to independent FASTA files for downstream use.
E. g. for each individual I get 22 autosomes plus sex chromosomes stored in individual FASTA files containing the appropriate contigs.
I was wondering, as it might be useful for testing a pipeline on specific regions, is there a way to extract centromeric regions from each one of the chromosome FASTA for those assemblies, and potentially save them as separate FASTA files?
I've seen people mentioning RepeatMasker
for similar tasks, but I never used the tools and I'm not familiar with what can actually do... any help is greatly appreciated, thanks in advance!
@GenoMax Indeed, I agree with your last point; finger crossed I can get something out of these assemblies since they are quite fragmented.
Good thing is that I aligned to CHM13v2.0 so whatever happens to be identified as centromere in those assemblies should be reported; thing is looking into
RepeatMasker
it isn't intuitive to me how to combine potential BED information with chromosome FASTA files (made of contigs) to extract the centromere sequence.Many thanks for confirming at least I'm on the right track, if you happen to have experience with the tool let me know. Thanks again!
P. S. looking at the help of
RepeatMasker
I couldn't find any specific config for centromeres... should I simply plug in the alpha sat sequence somehow for the engine to search for?Are these your own assemblies?
Which program did you use? That alignment should give you some idea if these regions are present in your assemblies. Since you know the coordinates of centromeres are there alignments close to those regions.
repeatmasker
is going to mark repeat regions. You will need to see if there are regions of tandem repeats. Not done this myself.Chances are that these regions are not represented in your data, if your assemblies were made from only short reads.