segemehl: How does the arrangement of references influence mapping results?
0
0
Entering edit mode
9 months ago
Cristian • 0

Hi All,

I'm currently conducting tests involving artificial reads with varying proportions of mutations and their impact on mapping results using segemehl v. 0.3.4. During my experimentation, I observed that certain reads, whose origin I am already aware of, are not aligning correctly to the intended reference. The reference sequences I'm working with consist of two tRNAs. In the first.fa file, the sequence Pro-CGG-2-1 comes before Pro-TGG-2-1, as follows:

first.fa

>Pro-CGG-2-1(Pro-CGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
>Pro-TGG-2-1(Pro-TGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA

Upon rearranging the sequence order in the second.fa file to have Pro-TGG-2-1 before Pro-CGG-2-1, like so:

>Pro-TGG-2-1(Pro-TGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA
>Pro-CGG-2-1(Pro-CGG)
GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
CCGGACGAGCCCCCA

Here the alignment between references:

CLUSTAL 2.1 multiple sequence alignment


Pro-CGG-2-1      GGCTCGTTGGTCTAGGGGTATGATTCTCGCTTCGGGTGTGAGAGGTCCCGGGTTCAAATC
Pro-TGG-2-1      GGCTCGTTGGTCTAGGGGTATGATTCTCGGTTTGGGTCCGAGAGGTCCCGGGTTCAAATC
                 ***************************** ** ****  *********************

Pro-CGG-2-1      CCGGACGAGCCCCCA
Pro-TGG-2-1      CCGGACGAGCCCCCA
                 ***************

I generated artificial reads using the Pro-TGG-2-1 sequence as a reference. One read had no mutations, while the other had a single mutation. The read sequences are as follows:

@SYN_100_0_74_0_-_8000_1_._Pro-TGG-2-1(Pro-TGG) 1:
TGGGGGCTCGTCCGGGATTTGAACCCGGGACCTCTCGGACCCAAACCGAGAATCATACCCCTAGACCAACGAGCC
+
@@@CA@CCEBFBEA@ACBAABBE@BBFBCEBDCE@CA@ABFDBBBCEEC@@BACBBCAB@BAAA@B@CDA@@@@@

and including one mutation:

@SYN_100_0_74_0_-_8000_1_._Pro-TGG-2-1(Pro-TGG) 1:
TGGGGGCTCGTCCGGGATTTGAACCCGGGACCTCTCGGACTCAAACCGAGAATCATACCCCTAGACCAACGAGCC
+
@@@CA@CCEBFBEA@ACBAABBE@BBFBCEBDCE@CA@ABFDBBBCEEC@@BACBBCAB@BAAA@B@CDA@@@@@

My code includes functions to index the genomes using segemehl and bowtie2, as well as to map reads using both tools, as follows:

#!/usr/bin/env bash

index_genome() {
    local rep=$1
    segemehl.x -x "${rep}.idx" -d "${rep}.fa"
}

index_genome_bowtie() {
    local rep=$1
    bowtie2-build "${rep}.fa" "${rep}.bt2"
}

map_reads() {
    local idx=$1
    local ref=$2
    local mutations=$3
    segemehl.x -i "${idx}" -d "${ref}.fa" -H 0 -q "read${mutations}.fq" -o "map_${ref}_${mutations}.sam" -u "map_${ref}_${mutations}_unmap.fq"
}

map_reads_bowtie() {
    local idx=$1
    local ref=$2
    local mutations=$3
    unmapped="map_${ref}_${mutations}_unmap_bowtie.fq"
    reads="read${mutations}.fq"
    sam_file="map_${ref}_${mutations}_bowtie.sam"
    bowtie2 -p 4 --local --sensitive-local --mp 3,1 --rdg 5,1 --rfg 5,1 --dpad 30 --maxins 800 --ignore-quals --no-unal -x $idx -U $reads --un $unmapped -S $sam_file
}

# Index genomes segemehl
index_genome "first"
index_genome "second"
# Index genomes bowtie
index_genome_bowtie "first"
index_genome_bowtie "second"

# Map reads for 'first' genome
for mutations in 0 1
do
    map_reads "first.idx" "first" "${mutations}"
    map_reads_bowtie "first.bt2" "first" "${mutations}"
done

# Map reads for 'second' genome
for mutations in 0 1
do
    map_reads "second.idx" "second" "${mutations}"
    map_reads_bowtie "second.bt2" "second" "${mutations}"
done

Upon running the code, I encountered the following mapping results when checking whether reads were correctly aligning to the Pro-TGG-2-1 reference:

map_first_0.sam False
map_first_1.sam False
map_second_0.sam    True
map_second_1.sam    True
map_first_0_bowtie.sam  True
map_first_1_bowtie.sam  True
map_second_0_bowtie.sam True
map_second_1_bowtie.sam True

Interestingly, the mapping using segemehl fails to correctly align reads when using the first.fa reference, but it succeeds when using second.fa. In contrast, bowtie2 consistently provides accurate mappings for both reference orders.

I'm curious about whether the order of references could potentially impact the mapping results, and I'm seeking insights to explain this behavior. Thank you all in advance for your help and suggestions.

Cheers!

PDA: The source code can be downloaded from here

segemehl mapping sequencing • 360 views
ADD COMMENT

Login before adding your answer.

Traffic: 1144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6