I am trying to conduct a methylation analysis on some bisulfite-seq data. I have to follow (as much as I can) the procedure describe in the following book chapter: https://pubmed.ncbi.nlm.nih.gov/28986896/ which describes a series of steps to be followed in order to obtain the differentially methylated genes in bisulfite seq data. One of the main issues that I encountered when I started following it was that I could not get Bis-SNP to work and instead had to use BS-SNPer.
The output generated by Bis-SNP in the book chapter is piped into methylKit by reformatting the .vcf file and extracting only the columns of interest. As written in the book:
Filter for SNPs using the bisSNP tool suite [...] Finally, process the data for methylKit input. Data from bisSNP are not correctly formatted for methylKit, and must be reformatted.
The output of BisSNP is there fore a .vcf file (described here https://people.csail.mit.edu/dnaase/bissnp2011/BisSNP-UserGuide-latest.pdf ) and I am particularly interested in the columns that signal:
- Number of Unconverted Cytosines in that position or column "CM"
- Number of Converted Cytosines or column "CU"
- Strand of Cytosine relative to reference genome or "CS"
An example of how this output looks like can be seen below:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MAPD_4T chr1 1581713 . A C 57.22 . CS=+;Context=MHH;DP=15;MQ0=1;NS=1;SB=-0.2185 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:25.2,26.0:0,1,0,3,0,5:0:MHH:1:15:0,5,0,3:57,0,126:57.22:5
The information that is relevant for me here would be:
- CM, which is 0
- CU, which is 1
This means that I have no unconverted Cytosines and 1 converted Cytosine in that particular position (1581713) and chromosome (chr1).
The methylKit converted version of this sample will look as follows:
chr1.1581713 chr1 1581713 + 1 0 1
Where the last two columns are the ones that pertain to the frequency of C and the frequency of T, in this case 0 C and 1 T.
After reading in many places that BS-SNPer is analog to BisSNP and faster, and due to time constraints, I obtained my own .vcf output files, but the issue I am faced is that BS-SNPer is not showing the CS, CM or CU columns whatsoever.
What I have therefore done is tried to interpret the columns that I DO have and try to find out what would be the analog for these three columns and here is where I would benefit from some input.
My .vcf file looks like this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XXXX.bam chr1 567119 . A C 12 . DP=4;ADF=0,0;ADR=0,4;AD=0,4; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 1/1:4:0,0:0,4:0,4:0,0,0,0,0,0,4,0:0,0,0,0,0,0,37,0:0.000,1.000
So the FORMAT description according to documentation and my interpretation of this example is as follows:
- GT"Genotype": 1/1 diploid call, genotype unphased
- DP "Number of high-quality bases"- total number of reads used to call variant:4
- ADF "Allelic depths forward strand": 0,0
- ADR "Allelic depths reverse strand":0,4
- AD "Allelic depths"- total number of reads used to call each allele: 0,4 (A ref allele, C alt allele)
- BSD "Depth ATCG in Watson and in crick":0,0,0,0,0,0,4,0
- BSQ "Average base quality ATCG in watson and in crick":0,0,0,0,0,0,37,0
- ALFR "Allele frequency":0.000,1.000
I understand that DP is the total number of reads used to call the variant, AD is the total number of reads with each allele but only the informative variants. In this example I interpret that 4 reads carry the REF allele (A) and 0 carry the ALT allele (C) so we have an A > C (REF > ALT) transition, and according to the last column ALFR, the allele frequency would be 0 for the C and 1 for A which I interpret as the 4 reads carry just the ref allele and 4 carry the alt (C) " so I have created a methylKit output format that states:
chr1.567119 chr1 567119 x y 0 1
Now, *x is something that I have added just to avoid leaving an empty spot and corresponds to what should be the strand (+ or -). I have been thinking about how to figure out the strand and I am thinking that maybe I can use the ADF (allelic depths forward strand) and ADR columns (allelic depths reverse strand) to identify if we have a C in + or in - strands.
For example, in this case I would think that since ADF is 0, 0 and ADR is 0,4, the strand would be -
My question for you is:
Does my interpretation make sense? Am I terribly wrong in my assumptions? Do you know if there is any comparison WITH EXAMPLES between BS-SNPer and BisSNP outputs that I could check (I have spent weeks looking for one but I have been completely unable to find any...).
Thank you for your time and help, I hope I explained myself enough... Alejandra