4.3 years ago by
Specifying the maximum intron length helps because it limits the search space for the "other end" of a read when it is being aligned to the genome. If the second half of your gene maps several MB away, it is unlikely that this represents a valid, biologically relevant, splice junction and is probably the result of a miss-alignment. If this is the case, it makes no sense to spend time looking MBs away for the mapping position of the second half of a split read.
It is also the case that some reference genomes contain gene models with unreasonably long introns, often that merge two genes together (i.e. one half of the junction is in one gene, and the other half is in a different gene, usually a different member of the same protein family).
A little bit of knowledge about your genome of interest can help here. In humans we use 2Mb as our maximum intron length because there is a gene with an intron that long that we are pretty confident is real (I don't remember which right now).
Otherwise you could trust the reference annotation and use the method outlined by Medhat.