I am very new at bioinformatics. I have collected bulkATACseq data from 3 different timepoints (2 embryonic timepoints and 1 adult) and a positive and negative for each timepoint.
I am trying to do kmeans clustering on the combined dataset to demonstrate dynamic chromatin accessibility throughout these different timepoints and conditions.
Here is the workflow I have put my data through so far:
- checked quality of ATAC reads using fastqc
- trim adapters with cutadapt
- map to genome with bowtie
- sort sam files by coord and convert to bam
- index bam files
- make bigwigs using bamcoverage (I have tried both CPM and RPGC normalization. I have also included "extend reads", "ignore duplicates", and "center reads", bin size 10, smooth length 30)
- call peaks using Genrich
- Diffbind analysis of each possible pairwise comparison
- combined the results of each diffbind analysis into one large bed file and filtered peaks by fold change greater than 1 or less than -1, and FDR < 0.05
- sorted bed file by chromosome
- used bedtools "merge" to get rid of duplicates
- ran compute matrix (deep tools) with the following inputs: bigwig file from every condition, 1 bed file including high confidence peaks from everywhere Diffbind comparison, --referencePoint center, -b 1500, -a 1500
- used plotHeatmap with the matrix generated with computeMatrix and have tried a bunch of different numbers for kmeans clusters.
Is this a workflow that makes sense? I am not getting clusters that are specific to my adult samples, which is surprising. I am wondering if I didn't normalize my data properly. Also, on my heatmap, depending on the number I input for kmeans, I am getting clusters that are skewed way to the left or right when they should be centered. What does this indicate? I am thinking it has something to do with the bed file I am inputting -- maybe I need to cut down the number of peaks? I have tried everything I can think of to change and am not sure what to try next.
Thank you in advance!