How can I Extract Transposable element from a new genome assembly?
0
0
Entering edit mode
3.0 years ago
jaqx008 ▴ 110

Hello All,

I have a bed file containing annotations for transposable elements that were generated from an old genome assembly. However, I have a new assembly which I believe to be superior and would like to use the old annotations to obtain TE coordinates from the new assembly.

What I tried>

Obtain fasta sequence from old genome. mapp this with bowtie2 to new genome to get aligned reads (--al).

The issue here is that, the new fasta generated contained the old contig names instead of the new contig names in the new genome.

Is there a different way to go about this to arrive at my goal without beginning annotation with repeatmasker from scratch?

I tried to use repeatmasker earlier but the output didnt look right to me and I'd rather just fetch the TEs from the new genome.

Thanks in advance.

genome transposons assembly annotations • 2.6k views
ADD COMMENT
1
Entering edit mode

Why would you not start from scratch with repeatmasker? It will be the most accurate way to mask repeats in your genome.

Simply mapping the 'old' ones on the new assembly will for sure be sub-optimal (you likely will miss TEs that where not there yet in the old assembly). You can use the TE-lib you have from the previous assembly so no need to re-build the lib itself

ADD REPLY
0
Entering edit mode

Thanks for your input. I did try to repeatmask the new genome but the resulting output appears to be very shallow and only have few hits. I will try to create lib with the TE hopefully that will give better output.

ADD REPLY
0
Entering edit mode

how did you do the repeatmasking then? which library did you use?

ADD REPLY
0
Entering edit mode

I used a public library of TEs from different plants and animals. It was a lib that was already available on my work computer so I am not sure exactly how it was obtained. Anyway I am running repeat masker right now with the annotated TEs from the old genome as lib.

ADD REPLY
0
Entering edit mode

Align your assemblies to each other using an aligner like lastz (LINK). You could also use blat if you are sure the assemblies are very similar. Once you find the corresponding hits transfer your annotations.

ADD REPLY
2
Entering edit mode

if you can transform the bed to gff (galaxy has tools for that), you could also use liftoff which wraps all of this in a pipeline https://github.com/agshumate/Liftoff

ADD REPLY
0
Entering edit mode

Hey Phillip. I tried to use liftoff since Repeatmasker takes forever (although its currently running). However, the run terminated with an error saying:

GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over.

see my command bellow and if I am missing something. I followed the liftoff documentation

Command

liftoff -g fileGFF -o BFL_TE.bed -chroms TXT -unplaced Unplaced newgenome.fna assembly.fasta

I assumed what it wants is the types of TEs so I provided it with a file containing a list with the option -f TYPES (file name)

DNA
LINE
LTR
...

but I still get the same error.

ADD REPLY
0
Entering edit mode

What does your GFF file look like? I needs to follow this format.

ADD REPLY
0
Entering edit mode
scaf_1  bed2gff region_0    2   797 0   +   .   region_0;
scaf_1  bed2gff region_1    849 936 0   +   .   region_1;
scaf_1  bed2gff region_2    1237    1369    0   +   .   region_2;
scaf_1  bed2gff region_3    2152    2171    0   +   .   region_3;
scaf_1  bed2gff region_4    2352    3238    0   +   .   region_4;
scaf_1  bed2gff region_5    3230    3413    0   +   .   region_5;

I believe that is the required format.

ADD REPLY
1
Entering edit mode

feature - feature type name, e.g. Gene, Variation, Similarity

These include

Many SO feature types are recognized in column 3 and converted to their INSDC equivalents. Commonly used types are:

    gene
    CDS
    mRNA
    exon
    five_prime_UTR
    three_prime_UTR
    rRNA
    tRNA
    ncRNA
    tmRNA
    transcript
    mobile_genetic_element
    origin_of_replication
    promoter
    repeat_region

You don't have any that match this list. May want to try mobile_genetic_element.

ADD REPLY
0
Entering edit mode

Many thanks. I changed every line in the third column to mobile_genetic_element and provided this in file (TYPE) with the -f option. It is currently running.

Thanks again.

ADD REPLY
0
Entering edit mode

So I thought at the end of the run I'd get the ne TEs. but the run ended with the following lines

[M::main] Real time: 403.799 sec; CPU: 402.876 sec; Peak RSS: 3.957 GB
lifting features
mapping unplaced genes
GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over.

it generated a bunch of files that I am not sure which one contain the annotated TEs. Can you help?

ADD REPLY

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6