I am trying to quantify/estimate the amount of double stranded ALU elements to compare two conditions. My very coarse approach was to align the reads to repeat element sequences, and then summarize how many reads map sense or antisense to each element. If I summarize all repeat elements, there is a bias to sense-mapping reads (~1.3). However, in the ALU elements, the ratio is nearly 1 with a shift in one of the conditions - which matches the experimental hypothesis.
The issue is that due to the repetitive nature of these elements I am having second thoughts about if this approach is at all valid.
Briefly, my approach to calculate strand bias in repeat elements:
- rRNA depleted RNA-seq, stranded library
- reads were mapped to transposable element sequences (derived from repeatmasker, one contig = one element, example below) with STAR, keeping one random alignment for any read that maps up to 100 locations
- Alignments in each repeat element sequence were then counted with the Bioconductor package
Rsamtoolswith the following setting:
- repeat elements were considered those whose name doesn't match "^5S|^7S|_n$|rRNA|^tRNA|^U[0-9]|^RNA"
- only proper read pairs were counted
- alignments in the forward stand
isFirstMateRead = TRUE, isMinusStrand = TRUE
- alignments the reverse strands
isFirstMateRead = TRUE,isMinusStrand = FALSE
- A ratio of sense /antisense reads was then calculate for each repeat element sequence.
Does this make sense at all? Is there a better way of doing it?
grep "AluJb" -A 100 all_repeats.hg38.fa | head -20 >AluJb NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNTCCATAAGAATGGAAAGAAAACATGGCCAGGTGCAGTGGC TCACACCTGTAATCCCACCACTTCAGGAGGCTGAGGCAACATGGCAAAACCTTCTCTTCA AAAAATTTTTTAAAAGTTAGCTGGATGTTGTGGAGGCAAGAGGATCACTTGAGGATCACT TGAGTCCATGAGGTCAAGGCTGCAGTGAGTCATGTTTGCACCACTGCACTCTAGCCTAGG TGACAGAGCTAGTCACTATCAAAAAAAAAAAAAAAAGAATGGAGAGAATGCTACATGAGA GAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNATAGATTTTTTTAAAAAGAAAACTGGCCAGGTACT GTGGCTTATGTCTGTAATATCAGCATGTTGGGAGGCCAAGGCAGGATTACTTGAGCCCAG AAATTCCAGACCAGCCTGAGAATTTGGCAAAACTCTGTCTCTACAAAAAATACAAAAATT AGCCAAGTTTGGTGGCATGTGCCTGTAGTACCAGCTACTTGGGAGGCTGAGGTGGAAGAA TAGCTTGAGTCTGGGAGGTCAAGGCTGCAATGAGCTGTGATTGCACCACTGCACTCAAGC CTGGGTGGTAGAGTAAGACCCTGTCTCAAAAAAAAAAAAAAAAAAAGAAAAATCACTAAG CAAAATAAGACATGTGAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN