Hi all,
I need to get the number of CpGs that are methylated or, unmethylated per read of WGBS data. I am trying the calculate PDR values as described here. So far I made some progress but need some help from you guys. I can see this methylated or, unmethylated Cs on IGV. But I was wondering how can I get the information per reads.
paste <(samtools view file.bam |  awk '{print $3 "\t" $4 "\t" $4+length($10) "\t" $10}') <(bedtools bamtobed -i file.bam | awk '{print $6, $4}') | head -1000 | intersectBed -a CG.bed
-b stdin -wa -wb | awk '{print $0, length($7), $2-$5,  substr( $7, (($2-$5)), 2)}' | head
Note that CG.bed is the coordinates for CpGs in human genome.
Last three columns are: read length, CpG location in read and 2 bases from that location. Output is:
chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCC   + HWI-ST700660:309:C7T40ACXX:6:1202:9409:92454/2 101 100 CC
chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTAACCCTAACCCTAACCCTA   + HWI-ST700660:309:C7T40ACXX:5:1107:12258:97118/2 101 100 TA
chr1    10468   10470   chr1    10369   10470   TAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:6:2308:9444:13430/1 101 99 CT
chr1    10468   10470   chr1    10369   10470   TAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTATTTTAATTTTAATTTTAATTTTNATTTT   - HWI-ST539:254:C7T4WACXX:2:1204:2095:36010/1 101 99 TT
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:7:1307:13501:63031/1 101 98 CC
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC   + HWI-ST700660:309:C7T40ACXX:6:2306:15611:23135/2 101 98 TA
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCGATCCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCCT   - HWI-ST699:265:C7RM3ACXX:7:2313:19303:78256/1 101 98 CC
chr1    10468   10470   chr1    10371   10472   ATTTTAATTTTTAATTTTTAATTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTCGAATTTTTAATTCGAATTTTATTCGA   - HWI-ST699:265:C7RM3ACXX:6:2101:2576:51036/1 101 97 TT
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAGCCCTAACCCTAACCCTAACCCT   - HWI-ST700660:310:C7T58ACXX:4:2115:12614:9155/1 101 97 AC
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAA   - HWI-ST539:254:C7T4WACXX:1:2211:11231:19289/1 101 97 CC
Two issues with this code are:
1) it is taking into consideration of soft clipped reads which I want to ignore. 2) indels may shift the position of CpGs in the reads which I am not considering. As a consequence, in the last column I can see the dyads of CC and TT which is not expected (ignoring SNVs right now) at a CpG location.
In summary, I want to get the number of CpGs methylated or, unmethylation per reads. Thanks!
PDR (Fig C,E) Landau DA et al, Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell. 2014 Dec 8;26(6):813-825:
Fig 1: PDR (Fig C,E)
Fig2: IGV screen shot of my data:
Interesting idea. I plan to analyze this, not sure what it might (potentially reflects heterogeneity) mean. Would you minding sharing your insight from your analysis or refer to your paper if it is already published.