On Bismark user manual, it says the following:
The script deduplicate_bismark is supposed to remove alignments to the same position in the genome from the Bismark mapping output (both single and paired-end SAM/BAM files), which can arise by e.g. excessive PCR amplification. Sequences which align to the same genomic position but on different strands are scored individually.
However, can someone explain to me how this dedupliation step will help downstream analysis? (I am not doing targeted-enrichment libraries, so I guess I should do the deduplication step, right?)
In addition, about the bed graph output, the manual says the following:
As the methylation percentage is per se not informative of the actual read coverage of detected methylated or unmethylated reads at a position, bismark2bedGraph also writes out a coverage file (using 1-based genomic genomic coordinates) that features two additional columns: <chromosome> <start position=""> <end position=""> <methylation percentage=""> <count methylated=""> <count unmethylated="">
I don't quite understand what "as the methylation percentage is per se not informative of the actual read coverage of detected methylated or unmethylated reads at a position" mean. Can someone explain how the --bedgraph and .cov file differ? The kind of downstream analysis I'd like to perform is to calculate the methylation percentage several kilbase upstream and downstream across a gene and see how the percentage changes. Does the methylation percentage I have in mind match the meaning of the fourth column in the .cov output from bismark? Sorry this is a long question, and I am new to methylation analysis. Thanks in advance!