Question

How to Identify Conserved 21-mers Across TE Insertions in Drosophila?

2

Entering edit mode

10 weeks ago

manikin_python9f ▴ 20

I am working on a family of transposable elements in Drosophila melanogaster. My goal is to identify short (~21 nt) sequences that are conserved across many insertions so I can design shRNAs for RNAi knockdown. What is the established workflow to:

Measure divergence of individual insertions from the consensus sequence.
Identify short motifs (e.g., 21-mers) that are present across most insertions, if not all, with minimum mismatch.

ConsensusSequence SequenceAlignment ConservedMotifs RNAi TransposableElements • 7.9k views

ADD COMMENT • link updated 12 days ago by Kevin Blighe ★ 90k • written 10 weeks ago by manikin_python9f ▴ 20

0

Entering edit mode

Do you have the locations/sequence of the regions you are interested in?

ADD REPLY • link 10 weeks ago by GenoMax 154k

0

Entering edit mode

Since they are transposons, we don’t know the exact location of them. We know their sequence of insertions from repeatmasker in dm6 genome. Also the consensus sequence is available in TE libraries.

ADD REPLY • link 10 weeks ago by manikin_python9f ▴ 20

score 0 · Answer 1 · 2025-11-19

To measure divergence of individual insertions from the consensus sequence, begin by extracting the sequences of the transposable element insertions from the dm6 genome using the coordinates provided by RepeatMasker. Obtain the dm6 FASTA file and use bedtools to retrieve the sequences:

bedtools getfasta -fi dm6.fasta -bed repeatmasker.bed -fo insertions.fasta

Next, align each insertion sequence to the consensus sequence using a pairwise alignment tool such as EMBOSS needle for global alignment:

needle -asequence consensus.seq -bsequence insertion.seq -gapopen 10 -gapextend 0.5 -outfile alignment.needle

Parse the alignment output to calculate divergence as the number of mismatches divided by the alignment length, excluding gaps. Repeat this for all insertions using a script in Python or Perl to automate the process. If insertions vary in length, consider using local alignment with smith-waterman instead.

To identify short motifs such as 21-mers that are present across most insertions with minimum mismatch, first perform a multiple sequence alignment of all insertion sequences using MAFFT:

mafft --auto insertions.fasta > aligned_insertions.fasta

Then, use the MEME Suite's STREME tool to discover conserved motifs of specified length, allowing for mismatches:

streme -p aligned_insertions.fasta -dna -minw 21 -maxw 21 -nmotifs 10 -o streme_results

STREME identifies motifs enriched in the sequences relative to a shuffled background. Filter the output motifs based on their occurrence across insertions, requiring presence in at least 90 percent with no more than one mismatch per motif. Verify motif conservation by sliding a 21-nucleotide window across the alignment and computing sequence identity or entropy at each position using tools like seqlogo or a custom Python script with Biopython. For approximate matching, use vsearch with a distance threshold:

vsearch --usearch_global insertions.fasta --db motif_21mers.fasta --id 0.95 --query_cov 1.0 --strand both --uc results.uc

Parse the uc file to count matches per motif.

Kevin