0
1
Entering edit mode
5.2 years ago
bioplanet ▴ 60

Hi,

I was using bbmerge with standard parameters to join paired-end sequencing files and it was working very well, e.g. an example:

BBMerge version 37.66
Total time: 17.237 seconds.

Pairs:                 2478660
Joined:                2272340     91.676%
Ambiguous:             197355      7.962%
No Solution:           8965        0.362%
Too Short:             0           0.000%

Avg Insert:            427.5
Standard Deviation:    8.0
Mode:                  428

Insert range:          40 - 481
90th percentile:       428
75th percentile:       428
50th percentile:       428
25th percentile:       428
10th percentile:       428


Got some new data now to analyze and, unfortunately, the overall quality was not that good (definitely worse than the first round). I did the trimming but still it did not improve much...Anyway, I was asked to proceed anyway and I need to merge the reads. I tried the same thing, but look at the result:

BBMerge version 37.66
Total time: 2.011 seconds.

Pairs:                 320821
Joined:                472         0.147%
Ambiguous:             320122      99.782%
No Solution:           227         0.071%
Too Short:             0           0.000%

Avg Insert:            424.0
Standard Deviation:    25.0
Mode:                  427

Insert range:          110 - 428
90th percentile:       427
75th percentile:       427
50th percentile:       427
25th percentile:       427
10th percentile:       427


As you can see, now bbmerge considers almost all of my reads "ambiguous" and does not join them! Is there a way around it? Maybe by lowering a threshold or something?

bbmerge alignment • 4.1k views
2
Entering edit mode

Merge first and then trim. That is @Brian's normal recommendation.

That said, if the reads don't overlap (i.e. the inserts are longer than number of sequencing cycles) then merging will not work.

0
Entering edit mode

Thanks genomax (also upvoted some previous posts). The thing is that even if I try to merge before trimming, they still remain "ambiguous" and discarded. As far as I can tell (don't know if this is correct), merging is basically done by comparing the quality strings on each pair.

For example, the pair below was deemed ambiguous:

@M00316:13:000000000-BDR9L:1:1101:26221:16273 1:N:0:CATGCTTA
AGCCACAACTAGAATGCAGTGAAAAGGGTGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTTAGTTACGCTAGGTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTGCGTTCTCGGTGCTTGGTGCGCCGAGAAGTGGCCAGGAACGACGGCACGCGCCGGGGAAGGCCACTTGTGGGGACGGGGCCAGAGATAGCGTTGTGAGCCAACAGGG
+

@M00316:13:000000000-BDR9L:1:1101:26221:16273 2:N:0:CATGCTTA
ACATCACGTCGCACGAGGAGGGCTAGAGCGTCGTGGAGCGGTACGAACGCGCCGGGGGCGGCGGCTCGACCGGCGGGGTGGACGGGCTGTACAAGGGGTGGATTGGGTGAGAGTGAGGTGTGCGGGAGCGGGTGGGTTCGAGCCGGGTCGCGCGATTGTGGTGGGGGGGGTAGTAGCTGCTGTTGGGTGGGGACGATGGGGGGGGCGTGGGGCTGGCGAGTGGGTGTGGCGGGGGGGGGGGGGGGGTGGT
+
>>1>1@@111>>110A0A0AA00000111//BAABB//0/AA/>B//>>>>/>>/>>E//>/---=;:-:-=;-=A--9;;-/;-;;9;;////;-:-;;9;//---;--;//;;/;;/-;;/-;;-;-;;99-;----;99--;-;;999--9-9-;//-;--;-;;9--:-99;////;;9/;;@-;-----;----9---;@----;9--------;9;;;--99:-9;@-;-;9;--;-;;--9--

1
Entering edit mode

You should look at the bbmap/docs/guides/BBMergeGuide.txt included in the distribution.

Is this NextSeq data? Poly-G's in read 2 may be fantom basecalls.

0
Entering edit mode

Amplicon-seq data, it is the same construct being used (~430 nts long).

Another example:

@M00316:13:000000000-BDR9L:1:1101:22760:19050 1:N:0:CATGCTTA
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAAAACTACGGCACGCTCCGGGGAAAGCGACTTGTGAGTACGAGGCCAGAGATGTCGTTATGACCCAACAGGA
+
@M00316:13:000000000-BDR9L:1:1101:22760:19050 2:N:0:CATGCTTA
ACATCACCTGGCACGACGAGGACTAGACCGTCGTGGAGCAGTACGAACGCGCCGAGGGCCGCCACTGCACCGGCGGCATGGACGAGCTGTACAAGGAGTGAATTGATTAAGAGTGAGGTTTGCGGGAGCGGGTTGCTTCGAACCGCCTTGAAGTATTGTCCTCCGCGGGTGAATAGCTCCTGTTGGGGGGTAACGATATGTGGGGCCTCGTACTCGCGAGTGGGTTGGCCCGGGGCGGGCGGTGGGGTTG
+
1>>>>B@111111111AAAAA00B011A1/ABAABB//A/BDB1B//AAA>>>>>>>E??>>///111B10>>>>B//<???1>->-<><1000=.<./=000=00<0:;=;0;=0==/::=0-==9=.=A-9;;;999;9-9-;-;9;;;///;//;//////;:;--;-;//;;/9/9/;/9;;@99-;-/;;-:--////;;--9-;9-9-/----;-;;9-9--/;-;;;;-;9;;9;9;;-;;--

1
Entering edit mode

Hi bioplanet, the first one does not merge in leeHom but the second does:

@M00316:13:000000000-BDR9L:1:1101:22760:19050
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGC
+


Even then, it seems really weird. It seems that there is an overlap but it is hard to see visually, and blasting it returns cloning vector.

0
Entering edit mode

So is this then a software problem? Because it seemed it was working so nicely in the first round of experiments (where I could merge 98% or so of them), so I did not give it too much thought... Is leeHom better then? (I mean, of course for a newly introduced person to genomics like me, I can't really tell, in the paper of BBmerge it said it outperforms everyone else...

0
Entering edit mode

can you check where you think the second example could merge? because I do not really see it

0
Entering edit mode

If you have bad data going in then no matter what tool you use you are not going to be able to salvage it. Since you say that these are amplicons have you tried to do a multiple sequence alignment with the reference?

Agree with @Gabriel. The read in the example above looks like it is part of some vector. So perhaps not real data?

0
Entering edit mode

@bioplanet could you try running your data through leeHom and check how it performs?

0
Entering edit mode

1
Entering edit mode

the original reads are 250bp and "merged" is 426 for an overlap of 74 in the middle. Even then, this overlap seems sketchy.

0
Entering edit mode

Hmm. Merged read you posted above is only 206 bp long. Can you check?

0
Entering edit mode

now? AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGCCACTTGTGAGTACGAGGCCCGAGATATCGTTATGCCCCAACAGGAGCTATTCACCCGCGGAGGACAATACTTCAAGGCGGTTCGAAGCAACCCGCTCCCGCAAACCTCACTCTTAATCAATTCACTCCTTGTACAGCTCGTCCATGCCGCCGGTGCAGTGGCGGCCCTCGGCGCGTTCGTACTGCTCCACGACGGTCTAGTCCTCGTCGTGCCAGGTGATGT

2
Entering edit mode

I see the overlap when doing a MSA with your merged read and the two originals. Perhaps your tool is able to do the merge because it is meant for ancient DNA. The overlap is not great though. Considering this is amplicon data such gymnastics should not be needed for the overlap.

1
Entering edit mode

Many thanks to all of you! I will try leeHom and maybe FLASH (just saw this) and see how it looks, will post again!

2
Entering edit mode

If BBMerge is not merging these reads then I would be skeptical if other packages work (leeHom is a special case since it is allowing for multiple mismatches in short stretches to account for ancient/fragmented DNA). You should very carefully check your results.

0
Entering edit mode

Hello everyone!

So, I also checked FLASH and still I cannot get overlap; so I am guessing it was not a problem of BBmerge, rather there is some problem with the data itself. I will treat them seperately for time being and maybe the next round of experiments will be more informative.

Many thanks to all!

2
Entering edit mode

Please validate comments/answers on your past questions. Otherwise we have no way to know if the solutions suggested worked. Up-voting/accepting answers (green check mark) is the way to do this.

0
Entering edit mode

out of curiosity, could you post the data somewhere?