Question: do paired end reads both mapped to the same gene appear consecutive to each other in SAM file (using bowtie2)
0
gravatar for Moses
3 months ago by
Moses60
united states/ Bloomingtion/ Indiana University Bloomington
Moses60 wrote:

Hi all,

I'm using bowtie2 to map short illumina reads (101 base long reads) back to genes as my reference database. However I noticed something odd in the sam file earlier. I'm using paired end reads and keeping multimaps also.

I originally thought when there is a paired end read that maps to the same gene, the two pairs of the reads should be listed consecutively, but I have noticed a few cases that are actually not. For example:

 read_1::205766 83  S1_ACR74212.1   667 255 101M    =   653 -115    GGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAGTCTAACAAGATGAT   5655;5:5<6?<<85;55;9675=A9C:?;=9?9<?55:86<5@:=@<69A=856A@;A=>?A7<>:=<69:<:::??::<86?E?>;@B?A=6;:;8:5<   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP

read_2::205766  163 S1_ACR74212.1   653 255 101M    =   667 115 ACGGTGGTGGAGAAGGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAG   <7:;7:9@=?>@><?@?:?C7@>9=@;=:@;<:6=E><=B7;?BC:=?@9:@?8@96:66;9A7C6699=AD6:;7<6>@869A@6<66C:6@?86=666<   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP

In the example above the read '205766' has both of its pairs mapped to the gene S1_ACR74212.1 and they appear in this sequence with this order. i.e. consecutively by listing pair 1 first and pair 2 following pair 1 immediately.

However this is not the case all the time. I came across the following example:

read_1::205932  81  S1_ACR76435.1   516 0   4M7I5M3I82M =   530 0   TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA   7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<>   AS:i:-61    XN:i:0  XM:i:5  XO:i:2  XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77   YT:Z:UP

read_1::205932  337 S1_ACR74576.1   516 0   4M7I5M3I82M S1_ACR76435.1   530 0   TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA   7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<>   AS:i:-61    XN:i:0  XM:i:5  XO:i:2  XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77   YT:Z:UP

read_2::205932  161 S1_ACR76435.1   530 0   101M    =   516 0   CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG   <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798   AS:i:-55    XN:i:0  XM:i:13 XO:i:0  XG:i:0  NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0   YT:Z:UP

read_2::205932  417 S1_ACR74576.1   530 0   101M    S1_ACR76435.1   516 0   CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG   <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798   AS:i:-55    XN:i:0  XM:i:13 XO:i:0  XG:i:0  NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0   YT:Z:UP

As you can see in this second example I have the read with ID '205932' that is multi-mapped between the two genes S1_ACR74576.1 and S1_ACR76435.1. However the order in which the read pairs are listed is not what I expected. Specifically for both maps to the two genes pair one is appearing first in the sam file and then pair 2 for the two genes are following: i.e. the order is 'read_1::205932' then 'read_1::205932' then 'read_2::205932' then 'read_2::205932'

instead of what I expected to be as: read_1::205932' then 'read_2::205932' then 'read_1::205932' then 'read_2::205932'

in most of the cases it is respecting the order however however there are some cases (like this example that I just posted) where this order and the pairs being listed consecutively are not being respected. Is there a bowtie2 parameter that takes care of this or it's usually not necessary for them to be consecutive? Any advice would be appreciated. Thanks.

UPDATE!!

this is the order of appearance of my reads (for this particular read in this case) in my fastq files for pair 1 and pair 2 respectively:

READ 205932 pair1

@read_1::205932 1:Bacteroides_fragilis_638R_uid84217
@read_1::205932 1:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_1::205932 1:Bifidobacterium_dentium_Bd1_uid43091
@read_1::205932 1:Coprococcus_catus_GD_7
@read_1::205932 1:Eubacterium_rectale_ATCC_33656_uid59169
@read_1::205932 1:Haemophilus_parainfluenzae_T3T1_uid72801

READ 205932 pair2

@read_2::205932 2:Bacteroides_fragilis_638R_uid84217
@read_2::205932 2:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_2::205932 2:Bifidobacterium_dentium_Bd1_uid43091
@read_2::205932 2:Coprococcus_catus_GD_7
@read_2::205932 2:Eubacterium_rectale_ATCC_33656_uid59169
@read_2::205932 2:Haemophilus_parainfluenzae_T3T1_uid72801
shortreads sam bowtie2 • 257 views
ADD COMMENTlink modified 3 months ago by Devon Ryan91k • written 3 months ago by Moses60

> --reorder Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file,

even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory than if --reorder were not specified. Has no effect if -p is set to 1, since output order will naturally correspond to input order in that case.

See Devon's comment below. My bad.

ADD REPLYlink modified 3 months ago • written 3 months ago by ATpoint19k

I got the same result unfortunately :(( I updated my original post and included a snapshot of the reasd as they appear in my fastq files, maybe that would help a bit more. The command I used with bowtie2 now is the following:

bowtie2 --reorder -a -x ../target_gene_indices/S1/S1_target_seqs_db -1 ../metagenome_sim_10/S1/S1_1.fq.gz -2 ../metagenome_sim_10/S1/S1_2.fq.gz -S ../target_gene_maps/S1.mg.sam.reordered
ADD REPLYlink modified 3 months ago • written 3 months ago by Moses60

You can always name-sort the resulting alignment files if you want them to be next to each other.

ADD REPLYlink written 3 months ago by genomax70k
1
gravatar for Rob
3 months ago by
Rob3.4k
United States
Rob3.4k wrote:

You might consider using --no-discordant and --no-mixed to avoid the complicated patterns that can occur when one alignment record (e.g. for read 1) is paied with multiple alignments (for read 2, or vice versa). Basically, for concordant alignments, Bowtie2 will fillow your expected format. With discordant or mixed (orphan) alignments, things can become very complicated and messy to parse.

ADD COMMENTlink written 3 months ago by Rob3.4k

thanks Rob for this guide. I made a postprocessing script where I check if there are reads like that and if both of the pairs are maped to the same gene/genome then I rearrange them such that read1 to gene and read2 to the same gene appear immidiately consecutive in the sam file. Also in the script I added another filtering step where if I encounter cases where a paired end read where both of it's pairs are mapped, based on what's indicated by their flags, however if each one of the pairs are mapped to a different fragment or gene or genome (i.e. to a different reference) then I am ignoring (filtering out these reads from the postpossessed sam file).

ADD REPLYlink written 3 months ago by Moses60
0
gravatar for Devon Ryan
3 months ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

Bowtie2, like most aligners, will always output paired-end reads in chunks where read #1 and its multimapping alignments comes first and then read #2 and its multimapping alignments comes next. That's usually easy enough to parse. If you need the primary and secondary alignment in a different order then you'll need to write a function to do this, as neither samtools nor most aligners will do it for you.

ADD COMMENTlink written 3 months ago by Devon Ryan91k

yeah I'm working on my own script to take care of this. But what's surprising me is that in most of the cases (multimapped cases) Bowtie2 is respecting the order of pairing in the sam file. i.e. Let's say I have a paired end read R1 and R2, that is mapped to 3 different genes g1,g2 and g3. In most of the cases in my sam file the pairing is listed as:

R1 g1 R2 g1 R1 g2 R2 g2 R1 g3 R2 g3

and in some rare cases which I don't understand why it's listing something like this

R1 g1 R1 g2 R2 g1 R2 g2

should the --reorder parameter fix this or is there any other parameter or function that's already defined that will take care of this?

ADD REPLYlink written 3 months ago by Moses60

--reorder is doing something different, it's ensuring that the order of the pairs matches what's in the fastq file, not that pairs themselves have any particular order.

ADD REPLYlink written 3 months ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1724 users visited in the last hour