Question: counting number of CpGs that are methylated and unmethylated in a read of whole genome bisulfite data
1
gravatar for Rashedul Islam
6 days ago by
Canada
Rashedul Islam310 wrote:

Hi all,

I need to get the number of CpGs that are methylated or, unmethylated per read of WGBS data. I am trying the calculate PDR values as described here. So far I made some progress but need some help from you guys. I can see this methylated or, unmethylated Cs on IGV. But I was wondering how can I get the information per reads.

paste <(samtools view file.bam |  awk '{print $3 "\t" $4 "\t" $4+length($10) "\t" $10}') <(bedtools bamtobed -i file.bam | awk '{print $6, $4}') | head -1000 | intersectBed -a CG.bed
-b stdin -wa -wb | awk '{print $0, length($7), $2-$5,  substr( $7, (($2-$5)), 2)}' | head

Note that CG.bed is the coordinates for CpGs in human genome. Last three columns are: read length, CpG location in read and 2 bases from that location. Output is:

chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCC   + HWI-ST700660:309:C7T40ACXX:6:1202:9409:92454/2 101 100 CC
chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTAACCCTAACCCTAACCCTA   + HWI-ST700660:309:C7T40ACXX:5:1107:12258:97118/2 101 100 TA
chr1    10468   10470   chr1    10369   10470   TAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:6:2308:9444:13430/1 101 99 CT
chr1    10468   10470   chr1    10369   10470   TAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTATTTTAATTTTAATTTTAATTTTNATTTT   - HWI-ST539:254:C7T4WACXX:2:1204:2095:36010/1 101 99 TT
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:7:1307:13501:63031/1 101 98 CC
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC   + HWI-ST700660:309:C7T40ACXX:6:2306:15611:23135/2 101 98 TA
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCGATCCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCCT   - HWI-ST699:265:C7RM3ACXX:7:2313:19303:78256/1 101 98 CC
chr1    10468   10470   chr1    10371   10472   ATTTTAATTTTTAATTTTTAATTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTCGAATTTTTAATTCGAATTTTATTCGA   - HWI-ST699:265:C7RM3ACXX:6:2101:2576:51036/1 101 97 TT
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAGCCCTAACCCTAACCCTAACCCT   - HWI-ST700660:310:C7T58ACXX:4:2115:12614:9155/1 101 97 AC
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAA   - HWI-ST539:254:C7T4WACXX:1:2211:11231:19289/1 101 97 CC

Two issues with this code are:

1) it is taking into consideration of soft clipped reads which I want to ignore. 2) indels may shift the position of CpGs in the reads which I am not considering. As a consequence, in the last column I can see the dyads of CC and TT which is not expected (ignoring SNVs right now) at a CpG location.

In summary, I want to get the number of CpGs methylated or, unmethylation per reads. Thanks!

PDR (Fig C,E) Landau DA et al, Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell. 2014 Dec 8;26(6):813-825:

Fig 1: PDR (Fig C,E)

Fig2: IGV screen shot of my data:

Fig 1

Fig 2

sequencing next-gen genome • 79 views
ADD COMMENTlink modified 6 days ago • written 6 days ago by Rashedul Islam310
3
gravatar for Devon Ryan
6 days ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

Have a look at MethylDackel perRead, which will do what you want properly.

ADD COMMENTlink modified 6 days ago • written 6 days ago by Devon Ryan90k

Thanks a lot for pointing me such a nice tool. The following command worked for me is:

MethylDackel perRead hg38.fa wgbs.bam

The output example is below. Can I interpret the output like that- read name, chr, start position of the read. For last two columns- there are 3 CpGs in those reads but in first and second reads 2/3 of the CpGs are methylated, for 3rd and 4th reads all 3/3 CpGs are methylated.

^CHS16_68:4:2113:9438:23805 chr10   4412865 66.666667   3
HS16_68:4:2309:8702:70123   chr10   4412866 66.666667   3
HS17_48:2:1103:6100:15523   chr10   4412872 100.000000  3
HS16_68:3:2109:20578:17792  chr10   4412872 100.000000  3
ADD REPLYlink written 6 days ago by Rashedul Islam310

Yes, that's correct.

ADD REPLYlink written 6 days ago by Devon Ryan90k

Thanks again. That was very helpful.

ADD REPLYlink written 6 days ago by Rashedul Islam310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1800 users visited in the last hour