Question: bbmerge not joining paired-end reads
1
gravatar for bioplanet
2.0 years ago by
bioplanet60
bioplanet60 wrote:

Hi,

I was using bbmerge with standard parameters to join paired-end sequencing files and it was working very well, e.g. an example:

BBMerge version 37.66
Total time: 17.237 seconds.

 Pairs:                 2478660
 Joined:                2272340     91.676%
 Ambiguous:             197355      7.962%
 No Solution:           8965        0.362%
 Too Short:             0           0.000%

 Avg Insert:            427.5
 Standard Deviation:    8.0
 Mode:                  428

 Insert range:          40 - 481
 90th percentile:       428
 75th percentile:       428
 50th percentile:       428
 25th percentile:       428
 10th percentile:       428

Got some new data now to analyze and, unfortunately, the overall quality was not that good (definitely worse than the first round). I did the trimming but still it did not improve much...Anyway, I was asked to proceed anyway and I need to merge the reads. I tried the same thing, but look at the result:

BBMerge version 37.66
Total time: 2.011 seconds.

 Pairs:                 320821
 Joined:                472         0.147%
 Ambiguous:             320122      99.782%
 No Solution:           227         0.071%
 Too Short:             0           0.000%

 Avg Insert:            424.0
 Standard Deviation:    25.0
 Mode:                  427

 Insert range:          110 - 428
 90th percentile:       427
 75th percentile:       427
 50th percentile:       427
 25th percentile:       427
 10th percentile:       427

As you can see, now bbmerge considers almost all of my reads "ambiguous" and does not join them! Is there a way around it? Maybe by lowering a threshold or something?

alignment bbmerge • 1.3k views
ADD COMMENTlink written 2.0 years ago by bioplanet60
2

Merge first and then trim. That is @Brian's normal recommendation.

That said, if the reads don't overlap (i.e. the inserts are longer than number of sequencing cycles) then merging will not work.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k

Thanks genomax (also upvoted some previous posts). The thing is that even if I try to merge before trimming, they still remain "ambiguous" and discarded. As far as I can tell (don't know if this is correct), merging is basically done by comparing the quality strings on each pair.

For example, the pair below was deemed ambiguous:

@M00316:13:000000000-BDR9L:1:1101:26221:16273 1:N:0:CATGCTTA
AGCCACAACTAGAATGCAGTGAAAAGGGTGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTTAGTTACGCTAGGTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTGCGTTCTCGGTGCTTGGTGCGCCGAGAAGTGGCCAGGAACGACGGCACGCGCCGGGGAAGGCCACTTGTGGGGACGGGGCCAGAGATAGCGTTGTGAGCCAACAGGG
+
>1>A?CFA1CFB113BABB1BA3310000BBBB1DAFBD2BD2DDDDD2BBDD11DD2D1BAD1DD2DD22DA1AD2BB2D11BB01B1111B/B1DBDDDDDBB>BBB0BB1>>>>>/>?/?>0<BB?0BB?0BB11BB0>?/</???FG?<?//<11??/1>><>--<./==<../.../..;-==9-;=-:--;;@@;--;--;-//;:;-;-9-;9;;;-:;;;9/://;;-9-;////;-9;;;-

@M00316:13:000000000-BDR9L:1:1101:26221:16273 2:N:0:CATGCTTA
ACATCACGTCGCACGAGGAGGGCTAGAGCGTCGTGGAGCGGTACGAACGCGCCGGGGGCGGCGGCTCGACCGGCGGGGTGGACGGGCTGTACAAGGGGTGGATTGGGTGAGAGTGAGGTGTGCGGGAGCGGGTGGGTTCGAGCCGGGTCGCGCGATTGTGGTGGGGGGGGTAGTAGCTGCTGTTGGGTGGGGACGATGGGGGGGGCGTGGGGCTGGCGAGTGGGTGTGGCGGGGGGGGGGGGGGGGTGGT
+
>>1>1@@111>>110A0A0AA00000111//BAABB//0/AA/>B//>>>>/>>/>>E//>/---=;:-:-=;-=A--9;;-/;-;;9;;////;-:-;;9;//---;--;//;;/;;/-;;/-;;-;-;;99-;----;99--;-;;999--9-9-;//-;--;-;;9--:-99;////;;9/;;@-;-----;----9---;@----;9--------;9;;;--99:-9;@-;-;9;--;-;;--9--
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by bioplanet60
1

You should look at the bbmap/docs/guides/BBMergeGuide.txt included in the distribution.

Is this NextSeq data? Poly-G's in read 2 may be fantom basecalls.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k

Amplicon-seq data, it is the same construct being used (~430 nts long).

Another example:

@M00316:13:000000000-BDR9L:1:1101:22760:19050 1:N:0:CATGCTTA
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAAAACTACGGCACGCTCCGGGGAAAGCGACTTGTGAGTACGAGGCCAGAGATGTCGTTATGACCCAACAGGA
+
A1AAAAA?AFFB1A3BBBB1BBD31B0A0BDBAD1D22D2BADADB1DDBADDDB2DDDDBADDFGHDDA2211BDDGGHD11BB/>B//11@/B1BBBB1BBBBB???BBBB??>/>/>???/B/BB<<BB?B1BB1BB@>?/</<<FGHGGA/>FFF>>>>><>><---..==///::.:::FC==;99;;9..==AG;;-;;-;;;;;/;9;9:9;;;@-9-;;;;;/;/;;;9;;9/9---;9;;-
@M00316:13:000000000-BDR9L:1:1101:22760:19050 2:N:0:CATGCTTA
ACATCACCTGGCACGACGAGGACTAGACCGTCGTGGAGCAGTACGAACGCGCCGAGGGCCGCCACTGCACCGGCGGCATGGACGAGCTGTACAAGGAGTGAATTGATTAAGAGTGAGGTTTGCGGGAGCGGGTTGCTTCGAACCGCCTTGAAGTATTGTCCTCCGCGGGTGAATAGCTCCTGTTGGGGGGTAACGATATGTGGGGCCTCGTACTCGCGAGTGGGTTGGCCCGGGGCGGGCGGTGGGGTTG
+
1>>>>B@111111111AAAAA00B011A1/ABAABB//A/BDB1B//AAA>>>>>>>E??>>///111B10>>>>B//<???1>->-<><1000=.<./=000=00<0:;=;0;=0==/::=0-==9=.=A-9;;;999;9-9-;-;9;;;///;//;//////;:;--;-;//;;/9/9/;/9;;@99-;-/;;-:--////;;--9-;9-9-/----;-;;9-9--/;-;;;;-;9;;9;9;;-;;--
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by bioplanet60
1

Hi bioplanet, the first one does not merge in leeHom but the second does:

@M00316:13:000000000-BDR9L:1:1101:22760:19050
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGC
+
A1AAAAAAFFB1A3BBBB1BBD31B0A0BDBAD1D22D2BADADB1DDBADDDB2DDDDBADDFGHDDA2211BDDGGHD11BB/>B//11@/B1BBBB1BBBBB???BBBB??>/>/>???/B/BB<<BB?B1BB1BB@>?/</<<FGHGGA/>FFF>>>>><>><---..==//%JY..Y%]]&\Z$XKZ%MMM\T;/X=%Z

Even then, it seems really weird. It seems that there is an overlap but it is hard to see visually, and blasting it returns cloning vector.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Gabriel R.2.6k

So is this then a software problem? Because it seemed it was working so nicely in the first round of experiments (where I could merge 98% or so of them), so I did not give it too much thought... Is leeHom better then? (I mean, of course for a newly introduced person to genomics like me, I can't really tell, in the paper of BBmerge it said it outperforms everyone else...

ADD REPLYlink written 2.0 years ago by bioplanet60

can you check where you think the second example could merge? because I do not really see it

ADD REPLYlink written 2.0 years ago by Gabriel R.2.6k

If you have bad data going in then no matter what tool you use you are not going to be able to salvage it. Since you say that these are amplicons have you tried to do a multiple sequence alignment with the reference?

Agree with @Gabriel. The read in the example above looks like it is part of some vector. So perhaps not real data?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k

@bioplanet could you try running your data through leeHom and check how it performs?

ADD REPLYlink written 2.0 years ago by Gabriel R.2.6k

Gabriel R. : Why is your merged read smaller than the two original reads?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k
1

the original reads are 250bp and "merged" is 426 for an overlap of 74 in the middle. Even then, this overlap seems sketchy.

ADD REPLYlink written 2.0 years ago by Gabriel R.2.6k

Hmm. Merged read you posted above is only 206 bp long. Can you check?

ADD REPLYlink written 2.0 years ago by genomax75k

now? AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGCCACTTGTGAGTACGAGGCCCGAGATATCGTTATGCCCCAACAGGAGCTATTCACCCGCGGAGGACAATACTTCAAGGCGGTTCGAAGCAACCCGCTCCCGCAAACCTCACTCTTAATCAATTCACTCCTTGTACAGCTCGTCCATGCCGCCGGTGCAGTGGCGGCCCTCGGCGCGTTCGTACTGCTCCACGACGGTCTAGTCCTCGTCGTGCCAGGTGATGT

ADD REPLYlink written 2.0 years ago by Gabriel R.2.6k
1

I see the overlap when doing a MSA with your merged read and the two originals. Perhaps your tool is able to do the merge because it is meant for ancient DNA. The overlap is not great though. Considering this is amplicon data such gymnastics should not be needed for the overlap.

align

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k
1

Many thanks to all of you! I will try leeHom and maybe FLASH (just saw this) and see how it looks, will post again!

ADD REPLYlink written 2.0 years ago by bioplanet60
1

If BBMerge is not merging these reads then I would be skeptical if other packages work (leeHom is a special case since it is allowing for multiple mismatches in short stretches to account for ancient/fragmented DNA). You should very carefully check your results.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax75k

Hello everyone!

So, I also checked FLASH and still I cannot get overlap; so I am guessing it was not a problem of BBmerge, rather there is some problem with the data itself. I will treat them seperately for time being and maybe the next round of experiments will be more informative.

Many thanks to all!

ADD REPLYlink written 2.0 years ago by bioplanet60
2

Please validate comments/answers on your past questions. Otherwise we have no way to know if the solutions suggested worked. Up-voting/accepting answers (green check mark) is the way to do this.

ADD REPLYlink written 2.0 years ago by genomax75k

out of curiosity, could you post the data somewhere?

ADD REPLYlink written 2.0 years ago by Gabriel R.2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1107 users visited in the last hour