Repeat masked gtf files from ensembl
1
1
Entering edit mode
4.3 years ago

I am looking for a repeat masked version of a reference genome from ensembl. I find information posted here that suggests accessing via ftp. I can't find figure out what subdirectory it might be hiding in. Searching via command-line didn't turn anything up.

I could run repeat-masker myself but as I remember setup is pretty involved (you need to download a repeat library, etc.)

I can post in some ensembl forum if that's more appropriate.

thanks,

ensembl RNA-Seq genome • 6.3k views
ADD COMMENT
1
Entering edit mode

Repeat masked gtf files from ensembl

Do you actually want a GFF file with regions of the genome that are repeat masked or a repeat masked genome sequence (answer provided by @h.mon below).

ADD REPLY
0
Entering edit mode

Hi, yes the gff (or gtf) file is what I'm looking for. Something analogous to this file from UCSC.

ADD REPLY
0
Entering edit mode

Do you mean a GTF or GFF containing the repeats, or just the genes as annotated on repeat-masked sequence?

ADD REPLY
1
Entering edit mode

Thanks for helping me clarify. I'm looking for a gtf file containing the locations of all masked regions in the ensembl reference genome. It might look like this:

chr1    hg38_rmsk   exon    67108754    67109046    1892.000000 +.  gene_id "L1P5"; transcript_id "L1P5";
ADD REPLY
0
Entering edit mode

All the gene annotation in Ensembl is done after repeat masking. There are folders of GFF3 and GTF files on the Ensembl FTP site.

ADD REPLY
0
Entering edit mode

Sorry, it appears to me that these gtf files are for the gene annotation in ensembl rather than an annotation of the masked regions (types of transposable elements, repeats, etc.)

Am I misunderstanding?

ADD REPLY
0
Entering edit mode

Yes, they are. I was still quite confused by your comment as it said you were after the repeats, but then the line you gave was an exon line, which would suggest you want the genes.

ADD REPLY
0
Entering edit mode

Ah, I see. think the 'exon' annotation be a confusing shortcut; the actual annotated feature is a LINE-1 element

ADD REPLY
0
Entering edit mode

Did you ever find a solution? I'm looking to do the same now.

ADD REPLY
0
Entering edit mode

I didn't find a solution. I think I eventually ran repeatmasker myself. The best reference in ensembl that I can find states: "Repeats can be viewed in the browser and extracted using our APIs. You can also download repeat-masked sequence from our FTP site, either hard-masked (rm) where repeats are replaced with Ns, or soft-masked (sm) where repeats are in lower-case text."

ADD REPLY
1
Entering edit mode

Turns out this information can be fetched via the Ensembl APIs. See my answer here.

ADD REPLY
0
Entering edit mode

So you took one of the fasta files for masked sequences (like Homo_sapiens.GRCh38.dna_rm.toplevel.fa.gz) and ran it through the repeatmasker? Did it spew out a gtf file as a result? I was looking to run velocyto RNA velocity analysis and thus was also looking for a gtf file for ensemble-based annotation

ADD REPLY
1
Entering edit mode

Did you figure out how to generate the file e.r.zakiev ? I'm trying to do the same.

ADD REPLY
0
Entering edit mode

hello cat.lou & thank you for reaching out! I'm still in the process of troubleshooting the whole scVelo pipeline (i get very low splicing rates, ~6%, here is the issue on biostars as well), so the end result might be wrong with my approach, but I did, in fact ran repeatmasker on the whole genome (it ran for 3 weeks straight on our cluster) and it spewed out an .out file which I then converted to .gff3 format using the repeatmasker's rmOutToGFF3.pl and then I converted the .gff3 file to gtf file using the AGAT suite (while not forgetting to rename all the 'masked repeats' entries to 'exon' in the chromosome column of the gff3 file) and it's currently where I stand. I didn't try to run scVelo using this masked repeats file yet, but will try today. If you want I can keep you updated!

ADD REPLY
0
Entering edit mode

also, i just tested the solution provided by liorglic above and it actually works! It just involved ~4 hours of setting up the necessary perl libraries (if you haven't set them up yet), but that's still much faster than 3 weeks!

I just put his code in a file called fetch_rmsk.pl and ran the following command:

perl fetch_rmsk.pl 'homo sapiens' 'vertebrates' rmsk_hs_ensembl.gtf
perl fetch_rmsk.pl 'mus musculus' 'vertebrates' rmsk_mm_ensembl.gtf

This fetching for human took ~50 minutes of downloading at a quite a low speed (the final gtf (bed?) file size is 783Mb) and ~30 min for mouse (461Mb)

ADD REPLY
1
Entering edit mode

Thanks e.r.zakiev - the perl scripts sound promising! Have you managed to successfully convert the bed file to GTF?

ADD REPLY
0
Entering edit mode

well, not yet. There is AGAT, which I didn't test yet for this purpose (but it worked for other stuff for me before, but for which - I couldn't remember now). I also asked liorglic in his topic if he was willing to reformat the script so that the output is a gtf file 5 days ago, but so far there was no response. Admittedly, it is a relatively simple modification, I just need to look up what are the fields of a standard GTF file and modify the perl script accordingly. I plan to do it this friday, so stay tuned!

ADD REPLY
0
Entering edit mode

Maybe this post helps with the conversion.. Fastest way to convert BED to GTF/GFF with gene_ids?

Naive question - how do you define which version ensembl to use when running the perl script?

ADD REPLY
0
Entering edit mode

that's a good question. I'd assume it downloads the latest build's repeat masked data, which is 112. Coming to think about it, you raised a good concern, as i would also be interested in a particular build version which is definitely not the current one, but one corresponding to the genome my data was aligned to... I am reasonably certain it should be possible to download the info from a certain build but i don't know how yet

ADD REPLY
1
Entering edit mode
4.3 years ago
h.mon 35k

The easiest way is to go to the organism Ensembl page, e.g. https://www.ensembl.org/Acanthochromis_polyacanthus/Info/Index for the spiny chromis. There, you will find a "Download DNA sequence (FASTA)" link which will take you directly to the ftp folder containing the complete genome, soft-masked genome, and hard-masked genome.

ADD COMMENT

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6