HI, I am working on some ATAC-seq data. We have performed paired-end sequencing with read length of 150 bp on a total of 24 samples (8 conditions in triplicates) To begin with, I will just briefly describe some of the main analysis steps.
I have trimmed the read to remove adapter contaminants and also filter out reads of low quality, using BBmap (version 38.58):
bbduk.sh in1=sample_R1.fastq.gz in2=sample_R2.fastq.gz out1=sample_R1_clean.fastq.gz out2=sample_R2_clean.fastq.gz ref=/home/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
bbmap/bbduk.sh in1=sample_R1_clean.fastq.gz in2=sample_R2_clean.fastq.gz out1=sample_R1_clean2.fastq.gz out2=sample_R2_clean2.fastq.gz qtrim=l trimq=30
Next, I mapped the reads to the human reference genome GRCh38 using Bowtie2 (version 2.3.4.1). For this, pre-built bowtie index files excluding ALT contigs were downloaded from the bowtie2 page at SourceForge. Code used:
bowtie2 -q -p 8 --very-sensitive-local -x GRCh38_noalt_as -1 sample_R1_clean2.fastq.gz -2 sample_R2_clean2.fastq.gz -S sample.sam
The sam files was further converted to bam files, sorted and indexed using samtools (version 1.9)
Performing a number of post-alignment quality control steps I could conclude that the samples contained between 35%-78% duplicates and between 27%-70% mitochondrial reads. Filtering these out together with blacklisted regions at peak calling with Genrich gave about 70k-100k peaks which on further inspection made sense.
However, now to one of my main concerns and the reason I am writing this post. For the bam files before any further filtering, when plotting fragment size distribution (using the R packages ATACseqQC and ggplot2) the graph shows a huge spike at exactly 150 bp (see image below)
First I was wondering whether this spike was due to duplicates, the mitochondrial reads or blcklisted regions. However, even after all of these have been removed the huge spike is still present, now with an even larger relative size compared to the rest of the profile (see image below)
I am hoping that someone could help me understand what is going on here and what may generate this enormous spike in the fragment size distribution. The fact that it occur at 150 bp is interesting, because that is the exact length of the sequencing reads, but I do not know whether that connection is important here.
Also, on a side note, I have tested several tools for calculating and plotting the fragment size distribution, and they gives a little different profiles, though the big spike at 150 bp is always there. However, the "fragSizeDist" function in ATACseqQC also gives a large frequency of fragments with size 0 bp (nearly as many as for the huge spike at 150 bp), which I remove before plotting.
As I have still not received any explanation regarding the strange fragment size distribution described above, I thought to revive this post by adding some additional distribution plots. I tested to use the picard tool CollectInsertSizeMetrics:
Below is the distribution when using an unfiltered bam file. Here, read orientation is also displayed (which I still need to understand what it means). The huge spike at 150 bp (actually 147-150 bp I have realized) is still there.
Next, I did the same for the filtered bam file (mitochondrial reads and reads in blacklisted regions removed) and the generated distribution is below. The change is similar to what is seen in my original post.
Is there anyone that can explain to me what the FR and RF orientation corresponds to? Is it the + and - strands? It looks like the RF-orientation abruptly stops at 150 bp, and I don't know what to make out of this.
Hope this additional information could help solving this frustrating fragment distribution problem.
Don't kill yourself on these things. If you have the library prep QC results (Bioanalyzer/TapeStation) available you can double-check whether this is already in the library (unlikely) or whether the aligner is doing something strange. I always use
--very-sensitive
withbowtie2
. Maybe the spike is some edge case coming from the local alignment together with how the aligner calculates the insert sizes, I do not know. I would proceed with downstream analysis and only go back if something looks very wrong there. Check data on the genome browser, do PCA and MA-plots, and then see if this spike merits a closer look or not.