Template Size In Mira3
1
0
Entering edit mode
8.3 years ago

Hi,

I am pretty new to genome assembly and in particular to mira3 and i have couple of questions regarding that.

1. What exactly is the templates size in mira3. I couldn't find a proper definition of the same in the manual. My fragment size before ligating adapters is ~250bp and after library construction it was found to be ~350. My read length is 260bp. In this scenario what is the exact template size they wanted in the configuration file?

2. I am working with bacterial miseq genome data and when i used mira3 to assembly the contigs, i found that there are ~1400 contigs in the final results file. Do you think are they too many? Does this has happened because of my wrong template size specification in the configuration file during mira3 assembly run?

Thanks Upendra

denovo assembly • 1.9k views
1
Entering edit mode
8.3 years ago
cts ★ 1.7k

The template size usually refers to the distance between the 5' ends of paired end data, in other words the length of the DNA between the adaptors. So in your case it would be ~250 bp.

0
Entering edit mode

Thanks cts for the detailed explanation. After going through some forums i found that actually for Mira you need two sizes, one is the actual insert size (which is ~250bp) and the fragment size (which is insert size (250) + total read length (520) = 770). Do you think this is correct? Also regarding adapter trimming, i found the following on their manaual..... "Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave Illumina data alone! (really, I mean it)". So i assume i cannot just trim the adapters from the reads then. Please let me know what you think.

Thanks Upendra

0
Entering edit mode

Hey,

So I think that there is some confusion with the terminology that different people use for 'insert size'. Many people (including myself) refer to the insert size as the distance between the 5' ends of the reads which will be the length of the fragment of DNA being sequenced. Other people use the insert size to describe the distance between the 3' ends of the reads, in other words the part of the DNA fragment that is between the two reads (not actually sequenced). So to illustrate this:

>---|                             read1
===================               DNA fragment
>>---------------<<      insert size using definition 1
>>----<<             insert size using definition 2


>--------|    read1 ( 260bp)
==========    DNA fragment (250bp)


(Please let me know if this interpretation of your data is wrong.)

So what you've done is sequence the same bit of DNA twice with both reads. In this case the insert size using mira's terminology would be 0 and the fragment size would be 250. However considering that read1 and read2 will be mostly identical you could either just assemble with either of them as single-end data and get similar results. Alternatively you could overlap the pairs using seqprep to get higher quality reads and then assemble as single-end data

0
Entering edit mode

Hi, Thanks again for the clarification. Your interpretation is spot on atleast in my case. Regarding the first figure i would normally say your definition to me seems correct. Actually this is not my experiment but i am helping other postdoc in the lab to analyze the data. Anyway i have just started running the analysis with single end reads and i will let you know if this actually improves the assembly. If this doesn't help then will try the "seqprep" method. Thanks Upendra

0
Entering edit mode

Hi, Even only using single end reads mira couldn't make good assembly. There are around 1300 contigs with N50 of only 4662. Though this is much better than Paired End assembly but i would like to make a better assembly. What do you think i need to make changes to get a better assembly?

0
Entering edit mode

You could try a different assembler as I mention in my original answer, other than that I'm not sure. Your data is suboptimal because the DNA fragment size is so short and it may be that what you're sequencing has a lot of repeats in it which is breaking the assembly into many contigs.