Minimus2 is for some reason dropping from the final assembly long reads which match some input contigs perfectly. These reads end up in the singletons file.
The input s1_s2.fasta file had 472178 haplotype specific contigs followed by 2215057 Megareads from MaSuRCA2. These are corrected PacBIO reads. Most of the time the long ones are a mixture of the two haplotypes, but in regions where the polymorphisms are especially dense it is very common for the final Megaread to be identical to one of the known haplotype sequences. (Even if the raw PacBIO started out as the other haplotype, but that's another story entirely!) So, doing a very conservative assembly:
nice ./minimus2-blat s1_s2 \
-D REFCOUNT=472178 \
-D OVERLAP=400 \
-D CONSERR=0.01 \
-D MINID=99 \
-D MAXTRIM=0 >try.log 2>&1 &
I was hoping to pull the nearby contigs together, many of which are only a couple of hundred base pairs apart.
After fixing two known bugs (blat2nucmer is broken, -tileSize in Minimus2 was the overlap size, had to be 12 or less), running steps 20-22 with a script to run blat in parallel (silly to do it single threaded on a 40 thread machine), and keeping only the 100% overlaps (before OVLTAB is created), the assembly completed.
Here's where things go wrong. Easiest to see the issue with an example. This is a contig with six perfectly matching Megaread overlaps
lastal -j3 -P40 -I 90 -f BlastTab mr3_10 contig_70973.fasta
#remove first column and keep only the 100% matches
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/53071/0_4656.0_4236 100.00 2456 0 0 1 2456 1781 4236 0 3.89e+03
m120326_203353_00121_c100309472550000001523013008061241_s1_p0/40121/0_7097.0_6696 100.00 2379 0 0 1 2379 4318 6696 0 3.76e+03
m120403_093653_00121_c100315332550000001523013408241246_s1_p0/70945/0_3885.0_3771 100.00 1456 0 0 1 1456 2316 3771 0 2.3e+03
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/12531/0_4737.0_538 100.00 538 0 0 1919 2456 1 538 5.6e-244 852
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/19555/0_4826.0_3083 100.00 464 0 0 464 1 1 464 1e-208 735
m120420_041440_00121_c100324062550000001523016709061295_s1_p0/22059/0_3066.0_1039 100.00 125 0 0 3002 3126 1 125 3e-47 199
This is an adjacent contig (position known from a separately sequenced BAC which includes this region):
lastal -j3 -P40 -I 90 -f BlastTab mr3_10 contig_138253.fasta
#remove first column and keep only the 100% matches
m120326_203353_00121_c100309472550000001523013008061241_s1_p0/40121/0_7097.0_6696 100.00 3675 0 0 1098 4772 1 3675 0 5.81e+03
m120422_031512_00121_c100323202550000001523016709061207_s1_p0/19555/0_4826.0_3083 100.00 1977 0 0 4772 2796 1107 3083 0 3.13e+03
m120403_093653_00121_c100315332550000001523013408241246_s1_p0/70945/0_3885.0_3771 100.00 1673 0 0 3100 4772 1 1673 0 2.65e+03
m120420_041440_00121_c100324062550000001523016709061295_s1_p0/7070/0_4649.0_1678 100.00 1561 0 0 1561 1 1 1561 0 2.47e+03
m120316_063132_00121_c100309582550000001523013008061204_s2_p0/76180/0_1450.0_1200 100.00 1200 0 0 3409 2210 1 1200 0 1.9e+03
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/53071/0_4656.0_4236 100.00 1138 0 0 3635 4772 1 1138 0 1.8e+03
m120419_080401_00121_c100323892550000001523016709061236_s2_p0/14792/0_3234.0_929 100.00 929 0 0 2015 1087 1 929 0 1.47e+03
m120421_050218_00121_c100323862550000001523016709061264_s1_p0/15287/0_5319.0_1662 100.00 844 0 0 844 1 1 844 0 1.34e+03
m120314_064602_00121_c100278602550000001523007907041270_s2_p0/3602/2096_4744.0_2408 100.00 744 0 0 744 1 1 744 0 1.18e+03
m120317_064124_00121_c100309522550000001523013008061267_s2_p0/42129/0_1973.0_1728 100.00 239 0 0 1 239 1490 1728 2.5e-101 379
Notice that megareads for cells 53071, 40121, 70945, and 19555 match both contigs perfectly. So one might have thought Minimus2 would have merged them and the original two contigs into a single contig. This did not happen.
Here are two of the contigs that Minimus2 created. This is "simple" output.
36825 ? m120420_041440_00121_c100324062550000001523016709061295_s1_p0/22059/0_3066.0_1039 1 0
36825 ? contig_70973 1 914
36825 ? m120422_031512_00121_c100323202550000001523016709061207_s1_p0/12531/0_4737.0_538 1 1584
66351 ? contig_138253 0 0
66351 ? m120419_080401_00121_c100323892550000001523016709061236_s2_p0/14792/0_3234.0_929 1 1086
66351 ? m120316_063132_00121_c100309582550000001523013008061204_s2_p0/76180/0_1450.0_1200 1 2209
Each has only two megareads, and these from the short ends of the lists. All the others seen in the lastal output landed in s1_s2.singletons.
The first, longest match (cell 53071) was in position 1055895 in s1_s2.fasta. The expected perfect matches are present in the counts, ovl, and OVL files. These are the s1_s2.ovl lines, for instance:
70973 1055895 N -1780 -670 2456 100.00
138253 1055895 N 3634 3098 1138 100.00
Then tigger runs and the trail is lost, with 1055895 not appearing in any of the log files.
I thought perhaps some of the 6 in the first lastal list might have mismatches against each other where they overlapped. So they were extracted and the full 6 x 6 comparison run with blastn. Where there were overlaps they were 100%, no mismatches, no indels.
So I am at a loss. Why would Minimus2 drop the large perfect reads which should have linked those two contigs. More to the point, how does one force it to "not do that"?
Thanks.
I don't think this is related, but between writing the *contig and *singletons files AMOS overwrote the original *.fasta file. The new one has only 297665 entries, named >1 to >297665. So it is probably a good idea to keep a differently named copy of the input file if it is at all difficult to assemble.