Hi, I'm fairly new to bioinformatics. I have some RNA-seq data for mapping an RNA modification. With a certain treatment, this modification causes the reverse transcriptase (RT) to stop (theoretically at the nucleotide immediately before the modification) or to misincoporate a nucleotide (in place of the modified base).
I have BAM files for the control (no treatment added to induce stops and mismatches) and treated (which should have more stops and mismatches) RNA samples.
To map the modification, I want to count the number of stops or mismatches at each nucleotide, divide that by the total number of reads (i.e. number of reads where there is a stop or mismatch + number of reads without). I want to do this for both the treatment and the control samples, then subtract the events that occur in the control from the treated sample to normalize for other factors that can cause stops or misincorporations. This is based on a method in this paper.
(Note: just to avoid any confusion, since misincorporations and stops are mutually exclusive events (a RT will not stop and misincorporate a base), and one of them occurs immediately before the modification (stop) while the other occurs in its place (mismatch), I would do the calculations for each event separately.)
The issue is that while I know the theory, I really don't know how to put it in practice and extract that information from the BAM files. I read about the SAM/BAM file formats and I know that they contain the information I need, but not sure how to extract it. I have a Windows 10 OS with Python 3 and R installed. I know a bit of R and have just recently started learning Python 3. My PI would like me to do this in Python, so if you can offer some insight as to how to get started with this in Python 3, I would really appreciate it. But if R is better, that's fine too, I just want to get this done. Thank you.