Question: Minimus2 dropping reads which perfectly match two adjacent input contigs
0
gravatar for dmathog
2.0 years ago by
dmathog30
dmathog30 wrote:

Minimus2 is for some reason dropping from the final assembly long reads which match some input contigs perfectly. These reads end up in the singletons file.

The input s1_s2.fasta file had 472178 haplotype specific contigs followed by 2215057 Megareads from MaSuRCA2. These are corrected PacBIO reads. Most of the time the long ones are a mixture of the two haplotypes, but in regions where the polymorphisms are especially dense it is very common for the final Megaread to be identical to one of the known haplotype sequences. (Even if the raw PacBIO started out as the other haplotype, but that's another story entirely!) So, doing a very conservative assembly:

nice ./minimus2-blat s1_s2  \
   -D REFCOUNT=472178  \
   -D OVERLAP=400   \
   -D CONSERR=0.01   \
   -D MINID=99 \
   -D MAXTRIM=0 >try.log 2>&1 &

I was hoping to pull the nearby contigs together, many of which are only a couple of hundred base pairs apart.

After fixing two known bugs (blat2nucmer is broken, -tileSize in Minimus2 was the overlap size, had to be 12 or less), running steps 20-22 with a script to run blat in parallel (silly to do it single threaded on a 40 thread machine), and keeping only the 100% overlaps (before OVLTAB is created), the assembly completed.

Here's where things go wrong. Easiest to see the issue with an example. This is a contig with six perfectly matching Megaread overlaps

lastal -j3 -P40 -I 90  -f BlastTab  mr3_10 contig_70973.fasta 
#remove first column and keep only the 100% matches
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/53071/0_4656.0_4236       100.00  2456    0       0       1       2456    1781    4236    0       3.89e+03
m120326_203353_00121_c100309472550000001523013008061241_s1_p0/40121/0_7097.0_6696       100.00  2379    0       0       1       2379    4318    6696    0       3.76e+03
m120403_093653_00121_c100315332550000001523013408241246_s1_p0/70945/0_3885.0_3771       100.00  1456    0       0       1       1456    2316    3771    0       2.3e+03
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/12531/0_4737.0_538        100.00  538     0       0       1919    2456    1       538     5.6e-244        852
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/19555/0_4826.0_3083       100.00  464     0       0       464     1       1       464     1e-208  735
m120420_041440_00121_c100324062550000001523016709061295_s1_p0/22059/0_3066.0_1039       100.00  125     0       0       3002    3126    1       125     3e-47   199

This is an adjacent contig (position known from a separately sequenced BAC which includes this region):

lastal -j3 -P40 -I 90  -f BlastTab  mr3_10 contig_138253.fasta
#remove first column and keep only the 100% matches
m120326_203353_00121_c100309472550000001523013008061241_s1_p0/40121/0_7097.0_6696       100.00  3675    0       0       1098    4772    1       3675    0       5.81e+03
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/19555/0_4826.0_3083       100.00  1977    0       0       4772    2796    1107    3083    0       3.13e+03
m120403_093653_00121_c100315332550000001523013408241246_s1_p0/70945/0_3885.0_3771       100.00  1673    0       0       3100    4772    1       1673    0       2.65e+03
m120420_041440_00121_c100324062550000001523016709061295_s1_p0/7070/0_4649.0_1678        100.00  1561    0       0       1561    1       1       1561    0       2.47e+03
m120316_063132_00121_c100309582550000001523013008061204_s2_p0/76180/0_1450.0_1200       100.00  1200    0       0       3409    2210    1       1200    0       1.9e+03
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/53071/0_4656.0_4236       100.00  1138    0       0       3635    4772    1       1138    0       1.8e+03
m120419_080401_00121_c100323892550000001523016709061236_s2_p0/14792/0_3234.0_929        100.00  929     0       0       2015    1087    1       929     0       1.47e+03
m120421_050218_00121_c100323862550000001523016709061264_s1_p0/15287/0_5319.0_1662       100.00  844     0       0       844     1       1       844     0       1.34e+03
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/3602/2096_4744.0_2408     100.00  744     0       0       744     1       1       744     0       1.18e+03
m120317_064124_00121_c100309522550000001523013008061267_s2_p0/42129/0_1973.0_1728       100.00  239     0       0       1       239     1490    1728    2.5e-101        379

Notice that megareads for cells 53071, 40121, 70945, and 19555 match both contigs perfectly. So one might have thought Minimus2 would have merged them and the original two contigs into a single contig. This did not happen.

Here are two of the contigs that Minimus2 created. This is "simple" output.

36825   ?       m120420_041440_00121_c100324062550000001523016709061295_s1_p0/22059/0_3066.0_1039       1       0
36825   ?       contig_70973    1       914
36825   ?       m120422_031512_00121_c100323202550000001523016709061207_s1_p0/12531/0_4737.0_538        1       1584

66351   ?       contig_138253   0       0
66351   ?       m120419_080401_00121_c100323892550000001523016709061236_s2_p0/14792/0_3234.0_929        1       1086
66351   ?       m120316_063132_00121_c100309582550000001523013008061204_s2_p0/76180/0_1450.0_1200       1       2209

Each has only two megareads, and these from the short ends of the lists. All the others seen in the lastal output landed in s1_s2.singletons.

The first, longest match (cell 53071) was in position 1055895 in s1_s2.fasta. The expected perfect matches are present in the counts, ovl, and OVL files. These are the s1_s2.ovl lines, for instance:

70973   1055895 N       -1780   -670    2456    100.00
138253  1055895 N       3634    3098    1138    100.00

Then tigger runs and the trail is lost, with 1055895 not appearing in any of the log files.

I thought perhaps some of the 6 in the first lastal list might have mismatches against each other where they overlapped. So they were extracted and the full 6 x 6 comparison run with blastn. Where there were overlaps they were 100%, no mismatches, no indels.

So I am at a loss. Why would Minimus2 drop the large perfect reads which should have linked those two contigs. More to the point, how does one force it to "not do that"?

Thanks.

assembly • 655 views
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by dmathog30

I don't think this is related, but between writing the *contig and *singletons files AMOS overwrote the original *.fasta file. The new one has only 297665 entries, named >1 to >297665. So it is probably a good idea to keep a differently named copy of the input file if it is at all difficult to assemble.

ADD REPLYlink written 2.0 years ago by dmathog30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour