Question: How to identify Accessible genes from DNase-seq?
1
gravatar for Bioinformatist Newbie
4.9 years ago by
Germany
Bioinformatist Newbie250 wrote:

I want to identify a list of accessible genes in a DNase-seq experiment. I downloaded the alignment file "ENCFF001CTZ".bam from Encode. I performed peak calling by using macs1.4 and obtained a peaks.bed file containing coordinates of peaks. Next, I performed peak annotation by Homer using annotatePeaks.pl. The Final annotation file I obtained contains following columns:

"Peak_ID" "Chr" "Start" "End" "Strand" "Peak.Score" "Focus.Ratio.Region.Size" "Annotation" "Detailed.Annotation" "Distance.to.TSS" "Nearest.PromoterID" "Entrez.ID" "Nearest.Unigene" "Nearest.Refseq" "Nearest.Ensembl" "Gene.Name" "Gene.Alias" "Gene.Description" "Gene.Type"

A screenshot of Annotation_File from Homer is here.

QUESTION: How I can finally obtain a list of genes which I can say are accessible? Do I need to only take genes for whose the column 'Annotation' tells me that region is near to 'promoter-TSS' or I can also take those which are 'Intergenic' or so?

What information this (that a gene's intergenic region or intron is accessible) can provide?

ADD COMMENTlink modified 4.9 years ago by jotan1.2k • written 4.9 years ago by Bioinformatist Newbie250
2
gravatar for jotan
4.9 years ago by
jotan1.2k
Australia
jotan1.2k wrote:

Strictly speaking, any DNAseI HS site that overlaps a gene would be an "accessible gene".

But I'm guessing that you're looking for a transcribed gene(?); a gene that is accessible to PolII? If that's the case, you want the promoter-TSS overlaps.

DNaseI sites that sit elsewhere in the genome could be enhancers, but you would need more information to define these. Or they could be virtually anything else. I'm not aware of any studies that have tried to characterise the protein profiles associated with HS sites at non-promoter sites.

ADD COMMENTlink modified 10 months ago by RamRS30k • written 4.9 years ago by jotan1.2k

Thanks for the response, So you mean talking about 'just accessible genes' I can take all those peaks which fall in intergenic or intron regions as well and say that these genes are accessible and an expression can be detected?

Yes, I want to make a list of all genes for which I can say that these genes are accessible and 'will' be transcribed.

Can you shed some light on peak scores as well? I have the peak score in range of 3205.08 to 50. I was trying to get an idea how the scores are calculated but I couldn't found a satisfactory answer. My question is: for the peak having the score (quite low) i.e 50.24 and distance of -20 from Promoter-TSS of HOXC6 gene. Can I say with confidence that HOXC6 gene is accessible and we can detect its expression?

ADD REPLYlink written 4.9 years ago by Bioinformatist Newbie250

So you mean talking about 'just accessible genes' I can take all those peaks which fall in intergenic or intron regions as well and say that these genes are accessible and an expression can be detected?

Not exactly. "Accessible genes" has no biological meaning. If you are looking for transcribed genes or genes that would be amenable to transcription, the DNaseI region should overlap the promoter specifically. I don't think anyone knows what an intergenic or intronic DNaseI site represents from a biological perspective.

My question is: for the peak having the score (quite low) i.e 50.24 and distance of -20 from Promoter-TSS of HOXC6 gene. Can I say with confidence that HOXC6 gene is accessible and we can detect its expression?

I would always be very careful about using arbitrary peak score cutoffs. These can be highly variable depending on experiments, sequencing, etc. In your case, the real key question is: Can you detect expression of HOXC6? If you can, then the DNaseI site is probably real.

ADD REPLYlink modified 10 months ago by RamRS30k • written 4.9 years ago by jotan1.2k

I used macs 1.4 and macs 2 for peak calling. Then I used Homer anonotatePeaks.pl) for peak annotation to hg19. Peak scores ranges from 3205.08 to 50 and 342.34 to 2 respectively. The number of peaks identified are 65229 and 74979 respectively. I used the default parameters in both cases. Why is there so much variation in peak scores? what is the role of this peak score ? It is not clear from the Homer website.

High peak score means that there are more chances of true positive peak? If yes, then what could be the right threshold to filter out the peaks?

In the supplementary material of a paper I found the information that they used Homer for chip-seq peaks identification and filter the peaks by peak score >10, but there they are not mentioning a reason behind this threshold.

ADD REPLYlink modified 10 months ago by RamRS30k • written 4.9 years ago by Bioinformatist Newbie250

Is there a standard threshold for DNAse-seq accessibility (ON and OFF regions) based on peak tag density? The peak score columns contains values from 20 to 1639. Can I apply a threshold (let's say 50) to say that those below it are not accessible and those having tag density higher than threshold are accessible. I have seen a couple of publication on ATAC-seq using a defined threshold for this purpose.

ADD REPLYlink written 4.6 years ago by Bioinformatist Newbie250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1628 users visited in the last hour