To measure divergence of individual insertions from the consensus sequence, begin by extracting the sequences of the transposable element insertions from the dm6 genome using the coordinates provided by RepeatMasker. Obtain the dm6 FASTA file and use bedtools to retrieve the sequences:
bedtools getfasta -fi dm6.fasta -bed repeatmasker.bed -fo insertions.fasta
Next, align each insertion sequence to the consensus sequence using a pairwise alignment tool such as EMBOSS needle for global alignment:
needle -asequence consensus.seq -bsequence insertion.seq -gapopen 10 -gapextend 0.5 -outfile alignment.needle
Parse the alignment output to calculate divergence as the number of mismatches divided by the alignment length, excluding gaps. Repeat this for all insertions using a script in Python or Perl to automate the process. If insertions vary in length, consider using local alignment with smith-waterman instead.
To identify short motifs such as 21-mers that are present across most insertions with minimum mismatch, first perform a multiple sequence alignment of all insertion sequences using MAFFT:
mafft --auto insertions.fasta > aligned_insertions.fasta
Then, use the MEME Suite's STREME tool to discover conserved motifs of specified length, allowing for mismatches:
streme -p aligned_insertions.fasta -dna -minw 21 -maxw 21 -nmotifs 10 -o streme_results
STREME identifies motifs enriched in the sequences relative to a shuffled background. Filter the output motifs based on their occurrence across insertions, requiring presence in at least 90 percent with no more than one mismatch per motif. Verify motif conservation by sliding a 21-nucleotide window across the alignment and computing sequence identity or entropy at each position using tools like seqlogo or a custom Python script with Biopython. For approximate matching, use vsearch with a distance threshold:
vsearch --usearch_global insertions.fasta --db motif_21mers.fasta --id 0.95 --query_cov 1.0 --strand both --uc results.uc
Parse the uc file to count matches per motif.
Kevin
Do you have the locations/sequence of the regions you are interested in?
Since they are transposons, we don’t know the exact location of them. We know their sequence of insertions from repeatmasker in dm6 genome. Also the consensus sequence is available in TE libraries.