How can I Extract Transposable element from a new genome assembly?
0
0
Entering edit mode
27 days ago
jaqx008 ▴ 90

Hello All,

I have a bed file containing annotations for transposable elements that were generated from an old genome assembly. However, I have a new assembly which I believe to be superior and would like to use the old annotations to obtain TE coordinates from the new assembly.

What I tried>

Obtain fasta sequence from old genome. mapp this with bowtie2 to new genome to get aligned reads (--al).

The issue here is that, the new fasta generated contained the old contig names instead of the new contig names in the new genome.

I tried to use repeatmasker earlier but the output didnt look right to me and I'd rather just fetch the TEs from the new genome.

genome transposons assembly annotations • 585 views
1
Entering edit mode

Why would you not start from scratch with repeatmasker? It will be the most accurate way to mask repeats in your genome.

Simply mapping the 'old' ones on the new assembly will for sure be sub-optimal (you likely will miss TEs that where not there yet in the old assembly). You can use the TE-lib you have from the previous assembly so no need to re-build the lib itself

0
Entering edit mode

Thanks for your input. I did try to repeatmask the new genome but the resulting output appears to be very shallow and only have few hits. I will try to create lib with the TE hopefully that will give better output.

0
Entering edit mode

how did you do the repeatmasking then? which library did you use?

0
Entering edit mode

I used a public library of TEs from different plants and animals. It was a lib that was already available on my work computer so I am not sure exactly how it was obtained. Anyway I am running repeat masker right now with the annotated TEs from the old genome as lib.

0
Entering edit mode

Align your assemblies to each other using an aligner like lastz (LINK). You could also use blat if you are sure the assemblies are very similar. Once you find the corresponding hits transfer your annotations.

2
Entering edit mode

if you can transform the bed to gff (galaxy has tools for that), you could also use liftoff which wraps all of this in a pipeline https://github.com/agshumate/Liftoff

0
Entering edit mode

Hey Phillip. I tried to use liftoff since Repeatmasker takes forever (although its currently running). However, the run terminated with an error saying:

GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over.


see my command bellow and if I am missing something. I followed the liftoff documentation

Command

liftoff -g fileGFF -o BFL_TE.bed -chroms TXT -unplaced Unplaced newgenome.fna assembly.fasta

I assumed what it wants is the types of TEs so I provided it with a file containing a list with the option -f TYPES (file name)

DNA
LINE
LTR
...


but I still get the same error.

0
Entering edit mode

What does your GFF file look like? I needs to follow this format.

0
Entering edit mode
scaf_1  bed2gff region_0    2   797 0   +   .   region_0;
scaf_1  bed2gff region_1    849 936 0   +   .   region_1;
scaf_1  bed2gff region_2    1237    1369    0   +   .   region_2;
scaf_1  bed2gff region_3    2152    2171    0   +   .   region_3;
scaf_1  bed2gff region_4    2352    3238    0   +   .   region_4;
scaf_1  bed2gff region_5    3230    3413    0   +   .   region_5;


I believe that is the required format.

1
Entering edit mode

feature - feature type name, e.g. Gene, Variation, Similarity

These include

Many SO feature types are recognized in column 3 and converted to their INSDC equivalents. Commonly used types are:

gene
CDS
mRNA
exon
five_prime_UTR
three_prime_UTR
rRNA
tRNA
ncRNA
tmRNA
transcript
mobile_genetic_element
origin_of_replication
promoter
repeat_region


You don't have any that match this list. May want to try mobile_genetic_element.

0
Entering edit mode

Many thanks. I changed every line in the third column to mobile_genetic_element and provided this in file (TYPE) with the -f option. It is currently running.

Thanks again.

0
Entering edit mode

So I thought at the end of the run I'd get the ne TEs. but the run ended with the following lines

[M::main] Real time: 403.799 sec; CPU: 402.876 sec; Peak RSS: 3.957 GB
lifting features
mapping unplaced genes
GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over.


it generated a bunch of files that I am not sure which one contain the annotated TEs. Can you help?