Diversity of Human Copy Number Variation and Multicopy Genes ,before mapping reads to reference genome, there is a step in reads preprocessing pipeline： "All reads exceeding 36 base pairs (bp) in length were truncated to 36 bp, or divided into their constituent nonoverlapping 36-bp sequences to eliminate potential mapping biases between genomes sequenced at different read lengths." Is this necessary? and why 36 bp? If I have a dataset that most read's lengths of all sample are about 95~100bp after qc, Can I just trim all reads into uniform length like 95bp? If I use mrfast and divide longer reads into 36bp，which tools can help me deal with pair-end sequences to keep them pairing after divided。
From a quick glance at the abstract of that paper, I'm guessing that they wanted to be able to directly compare the results across many samples that were sequenced with different read lengths. Under typical circumstances, there shouldn't be any reason to split your reads up. In fact, longer reads allow you to map into repetitive regions that shorter reads can't access. This enhances your ability to detect CNV in these potentially unstable regions.
Another possible consideration: certain short read mappers can accept only a small number of mismatches to the reference before they fail to map the read. Longer reads have a higher probability of accruing mismatches for a given error rate... and for some technologies the error rate increases with read length.
It is possible for longer reads to have a lower mapping rate than shorter ones (of course, too short reads have higher mapping ambiguity). A simple method for making samples comparable is to trim the reads as described.