Hi All,
I'm currently conducting tests involving artificial reads with varying proportions of mutations and their impact on mapping results using segemehl
v. 0.3.4. During my experimentation, I observed that certain reads, whose origin I am already aware of, are not aligning correctly to the intended reference. The reference sequences I'm working with consist of two tRNAs. In the first.fa
file, the sequence Pro-CGG-2-1
comes before Pro-TGG-2-1
, as follows:
first.fa
>Pro-CGG-2-1(Pro-CGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
>Pro-TGG-2-1(Pro-TGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
Upon rearranging the sequence order in the second.fa
file to have Pro-TGG-2-1
before Pro-CGG-2-1
, like so:
>Pro-TGG-2-1(Pro-TGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
>Pro-CGG-2-1(Pro-CGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
Here the alignment between references:
CLUSTAL 2.1 multiple sequence alignment
Pro-CGG-2-1 GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
Pro-TGG-2-1 GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
***************************** ** **** *********************
Pro-CGG-2-1 CCGGACGAGCCCCCA
Pro-TGG-2-1 CCGGACGAGCCCCCA
***************
I generated artificial reads using the Pro-TGG-2-1
sequence as a reference. One read had no mutations, while the other had a single mutation. The read sequences are as follows:
@SYN_100_0_74_0_-_8000_1_._Pro-TGG-2-1(Pro-TGG) 1:
TGGGGGCTCGTCCGGGATTTGAACCCGGGACCTCTCGGACCCAAACCGAGAATCATACCCCTAGACCAACGAGCC
+
@@@CA@CCEBFBEA@ACBAABBE@BBFBCEBDCE@CA@ABFDBBBCEEC@@BACBBCAB@BAAA@B@CDA@@@@@
and including one mutation:
@SYN_100_0_74_0_-_8000_1_._Pro-TGG-2-1(Pro-TGG) 1:
TGGGGGCTCGTCCGGGATTTGAACCCGGGACCTCTCGGACTCAAACCGAGAATCATACCCCTAGACCAACGAGCC
+
@@@CA@CCEBFBEA@ACBAABBE@BBFBCEBDCE@CA@ABFDBBBCEEC@@BACBBCAB@BAAA@B@CDA@@@@@
My code includes functions to index the genomes using segemehl
and bowtie2
, as well as to map reads using both tools, as follows:
#!/usr/bin/env bash
index_genome() {
local rep=$1
segemehl.x -x "${rep}.idx" -d "${rep}.fa"
}
index_genome_bowtie() {
local rep=$1
bowtie2-build "${rep}.fa" "${rep}.bt2"
}
map_reads() {
local idx=$1
local ref=$2
local mutations=$3
segemehl.x -i "${idx}" -d "${ref}.fa" -H 0 -q "read${mutations}.fq" -o "map_${ref}_${mutations}.sam" -u "map_${ref}_${mutations}_unmap.fq"
}
map_reads_bowtie() {
local idx=$1
local ref=$2
local mutations=$3
unmapped="map_${ref}_${mutations}_unmap_bowtie.fq"
reads="read${mutations}.fq"
sam_file="map_${ref}_${mutations}_bowtie.sam"
bowtie2 -p 4 --local --sensitive-local --mp 3,1 --rdg 5,1 --rfg 5,1 --dpad 30 --maxins 800 --ignore-quals --no-unal -x $idx -U $reads --un $unmapped -S $sam_file
}
# Index genomes segemehl
index_genome "first"
index_genome "second"
# Index genomes bowtie
index_genome_bowtie "first"
index_genome_bowtie "second"
# Map reads for 'first' genome
for mutations in 0 1
do
map_reads "first.idx" "first" "${mutations}"
map_reads_bowtie "first.bt2" "first" "${mutations}"
done
# Map reads for 'second' genome
for mutations in 0 1
do
map_reads "second.idx" "second" "${mutations}"
map_reads_bowtie "second.bt2" "second" "${mutations}"
done
Upon running the code, I encountered the following mapping results when checking whether reads were correctly aligning to the Pro-TGG-2-1
reference:
map_first_0.sam False
map_first_1.sam False
map_second_0.sam True
map_second_1.sam True
map_first_0_bowtie.sam True
map_first_1_bowtie.sam True
map_second_0_bowtie.sam True
map_second_1_bowtie.sam True
Interestingly, the mapping using segemehl
fails to correctly align reads when using the first.fa
reference, but it succeeds when using second.fa
. In contrast, bowtie2
consistently provides accurate mappings for both reference orders.
I'm curious about whether the order of references could potentially impact the mapping results, and I'm seeking insights to explain this behavior. Thank you all in advance for your help and suggestions.
Cheers!
PDA: The source code can be downloaded from here