I am looking for a repeat masked version of a reference genome from ensembl. I find information posted here that suggests accessing via ftp. I can't find figure out what subdirectory it might be hiding in. Searching via command-line didn't turn anything up.
I could run repeat-masker myself but as I remember setup is pretty involved (you need to download a repeat library, etc.)
I can post in some ensembl forum if that's more appropriate.
thanks,
Do you actually want a GFF file with regions of the genome that are repeat masked or a repeat masked genome sequence (answer provided by @h.mon below).
Hi, yes the gff (or gtf) file is what I'm looking for. Something analogous to this file from UCSC.
Do you mean a GTF or GFF containing the repeats, or just the genes as annotated on repeat-masked sequence?
Thanks for helping me clarify. I'm looking for a gtf file containing the locations of all masked regions in the ensembl reference genome. It might look like this:
All the gene annotation in Ensembl is done after repeat masking. There are folders of GFF3 and GTF files on the Ensembl FTP site.
Sorry, it appears to me that these gtf files are for the gene annotation in ensembl rather than an annotation of the masked regions (types of transposable elements, repeats, etc.)
Am I misunderstanding?
Yes, they are. I was still quite confused by your comment as it said you were after the repeats, but then the line you gave was an exon line, which would suggest you want the genes.
Ah, I see. think the 'exon' annotation be a confusing shortcut; the actual annotated feature is a LINE-1 element
Did you ever find a solution? I'm looking to do the same now.
I didn't find a solution. I think I eventually ran repeatmasker myself. The best reference in ensembl that I can find states: "Repeats can be viewed in the browser and extracted using our APIs. You can also download repeat-masked sequence from our FTP site, either hard-masked (rm) where repeats are replaced with Ns, or soft-masked (sm) where repeats are in lower-case text."
Turns out this information can be fetched via the Ensembl APIs. See my answer here.
So you took one of the fasta files for masked sequences (like
Homo_sapiens.GRCh38.dna_rm.toplevel.fa.gz
) and ran it through therepeatmasker
? Did it spew out agtf
file as a result? I was looking to run velocyto RNA velocity analysis and thus was also looking for agtf
file for ensemble-based annotation