I want to identify a list of accessible genes in a DNase-seq experiment. I downloaded the alignment file
ENCFF001CTZ.bam from Encode. I performed peak calling by using macs1.4 and obtained a peaks.bed file containing coordinates of peaks. Next, I performed peak annotation by Homer using
annotatePeaks.pl. The Final annotation file I obtained contains following columns:
"Peak_ID" "Chr" "Start" "End" "Strand" "Peak.Score" "Focus.Ratio.Region.Size" "Annotation" "Detailed.Annotation" "Distance.to.TSS" "Nearest.PromoterID" "Entrez.ID" "Nearest.Unigene" "Nearest.Refseq" "Nearest.Ensembl" "Gene.Name" "Gene.Alias" "Gene.Description" "Gene.Type"
A screenshot of Annotation_File from Homer is here.
QUESTION: How I can finally obtain a list of genes which I can say are accessible? Do I need to only take genes for whose the column 'Annotation' tells me that region is near to 'promoter-TSS' or I can also take those which are 'Intergenic' or so?
What information this (that a gene's intergenic region or intron is accessible) can provide?
Thanks for the response, So you mean talking about 'just accessible genes' I can take all those peaks which fall in intergenic or intron regions as well and say that these genes are accessible and an expression can be detected?
Yes, I want to make a list of all genes for which I can say that these genes are accessible and 'will' be transcribed.
Can you shed some light on peak scores as well? I have the peak score in range of 3205.08 to 50. I was trying to get an idea how the scores are calculated but I couldn't found a satisfactory answer. My question is: for the peak having the score (quite low) i.e 50.24 and distance of -20 from Promoter-TSS of HOXC6 gene. Can I say with confidence that HOXC6 gene is accessible and we can detect its expression?
Not exactly. "Accessible genes" has no biological meaning. If you are looking for transcribed genes or genes that would be amenable to transcription, the DNaseI region should overlap the promoter specifically. I don't think anyone knows what an intergenic or intronic DNaseI site represents from a biological perspective.
I would always be very careful about using arbitrary peak score cutoffs. These can be highly variable depending on experiments, sequencing, etc. In your case, the real key question is: Can you detect expression of HOXC6? If you can, then the DNaseI site is probably real.
I used macs 1.4 and macs 2 for peak calling. Then I used Homer anonotatePeaks.pl) for peak annotation to hg19. Peak scores ranges from 3205.08 to 50 and 342.34 to 2 respectively. The number of peaks identified are 65229 and 74979 respectively. I used the default parameters in both cases. Why is there so much variation in peak scores? what is the role of this peak score ? It is not clear from the Homer website.
High peak score means that there are more chances of true positive peak? If yes, then what could be the right threshold to filter out the peaks?
In the supplementary material of a paper I found the information that they used Homer for chip-seq peaks identification and filter the peaks by peak score >10, but there they are not mentioning a reason behind this threshold.
Is there a standard threshold for DNAse-seq accessibility (ON and OFF regions) based on peak tag density? The peak score columns contains values from 20 to 1639. Can I apply a threshold (let's say 50) to say that those below it are not accessible and those having tag density higher than threshold are accessible. I have seen a couple of publication on ATAC-seq using a defined threshold for this purpose.