ATAC-seq fragment size distribution - huge spike at 150 bp
3
1
Entering edit mode
2.1 years ago
Simon ▴ 10

HI, I am working on some ATAC-seq data. We have performed paired-end sequencing with read length of 150 bp on a total of 24 samples (8 conditions in triplicates) To begin with, I will just briefly describe some of the main analysis steps.

I have trimmed the read to remove adapter contaminants and also filter out reads of low quality, using BBmap (version 38.58):

bbduk.sh in1=sample_R1.fastq.gz in2=sample_R2.fastq.gz out1=sample_R1_clean.fastq.gz out2=sample_R2_clean.fastq.gz ref=/home/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

bbmap/bbduk.sh in1=sample_R1_clean.fastq.gz in2=sample_R2_clean.fastq.gz out1=sample_R1_clean2.fastq.gz out2=sample_R2_clean2.fastq.gz qtrim=l trimq=30

Next, I mapped the reads to the human reference genome GRCh38 using Bowtie2 (version 2.3.4.1). For this, pre-built bowtie index files excluding ALT contigs were downloaded from the bowtie2 page at SourceForge. Code used:

bowtie2 -q -p 8 --very-sensitive-local -x GRCh38_noalt_as -1 sample_R1_clean2.fastq.gz -2 sample_R2_clean2.fastq.gz -S sample.sam

The sam files was further converted to bam files, sorted and indexed using samtools (version 1.9)

Performing a number of post-alignment quality control steps I could conclude that the samples contained between 35%-78% duplicates and between 27%-70% mitochondrial reads. Filtering these out together with blacklisted regions at peak calling with Genrich gave about 70k-100k peaks which on further inspection made sense.

However, now to one of my main concerns and the reason I am writing this post. For the bam files before any further filtering, when plotting fragment size distribution (using the R packages ATACseqQC and ggplot2) the graph shows a huge spike at exactly 150 bp (see image below)

enter image description here

First I was wondering whether this spike was due to duplicates, the mitochondrial reads or blcklisted regions. However, even after all of these have been removed the huge spike is still present, now with an even larger relative size compared to the rest of the profile (see image below)

enter image description here

I am hoping that someone could help me understand what is going on here and what may generate this enormous spike in the fragment size distribution. The fact that it occur at 150 bp is interesting, because that is the exact length of the sequencing reads, but I do not know whether that connection is important here.

Also, on a side note, I have tested several tools for calculating and plotting the fragment size distribution, and they gives a little different profiles, though the big spike at 150 bp is always there. However, the "fragSizeDist" function in ATACseqQC also gives a large frequency of fragments with size 0 bp (nearly as many as for the huge spike at 150 bp), which I remove before plotting.

ATAC-seq • 3.0k views
ADD COMMENT
2
Entering edit mode
13 months ago
pgp.martin ▴ 30

I have the same observation in my ATAC-seq data.
DNBseq paired end 2x100bp reads after alignment with bowtie2 generate an overabundance of fragments of 101 and 102 bp and a depletion of fragments of 98bp and 99bp. The proportion of 100bp fragments is as expected. It is thus independent of the technology (or duplicate fragments as you mention) but rather due to the analysis.
I believe it comes from a combination of the read trimming (cutadapt in my case) and using the --dovetail option in bowtie2. If the DNA fragment is actually 98 or 99bp long, the trimming algorithm will not trim the first 1-2 bases coming from the adapters from the reads (too few bases representing the adapter). The reads will still be 100bp long (in my case) and the alignment will thus generate fragments of 101 or 102bp (1 or 2 extra bases at the end of each read = +2 or +4bp in the aligned fragment if the fragment is respectively 99bp or 98bp), possibly with mismatches at the end of the alignments.
Hope this helps understand your observation.

ADD COMMENT
0
Entering edit mode

Great catch!

ADD REPLY
1
Entering edit mode
2.1 years ago

Could be the core nucleosome particles?

Should be at about 151-152 bp

If it was over saturated with the transposase it would theoretically make sense to see this... but is 150 bp your read size aswell?

ADD COMMENT
0
Entering edit mode

Hi benformatics, I have seen others mention this, but as you said it should be at about 151-152 bp.

Yes, 150 bp is also the read size used in this experiment.

The frequency table of fragment sizes for the full bam (first image in the post) looks like this:

144 210926
145 229622
146 222414
147 283098
148 287900
149 353896
150 1721198
151 302068
152 247526
153 235998
154 222304
155 221544
156 224576
157 218348
158 215728

Here it is clear that 150 bp is the only size standing out. I have been thinking that maybe there are some unpaired reads that are still retained in the bam file and that ATACseqQC consider these in its calculation as 150 bp fragments. However, according to the alignment output from Bowtie2 (below) all the reads in the bam file were paired.

40043473 reads; of these:
  40043473 (100.00%) were paired; of these:
    12868345 (32.14%) aligned concordantly 0 times
    6522844 (16.29%) aligned concordantly exactly 1 time
    20652284 (51.57%) aligned concordantly >1 times
    ----
    12868345 pairs aligned concordantly 0 times; of these:
      2290511 (17.80%) aligned discordantly 1 time
    ----
    10577834 pairs aligned 0 times concordantly or discordantly; of these:
      21155668 mates make up the pairs; of these:
        1258245 (5.95%) aligned 0 times
        964052 (4.56%) aligned exactly 1 time
        18933371 (89.50%) aligned >1 times
98.43% overall alignment rate 
ADD REPLY
0
Entering edit mode

I was thinking that maybe one possibility is that you have a lot of degradation and so you have a bunch of short fragment that are getting sequenced from both ends for completely overlapping pairs... you should take a look at a where the alignments of these 150-bp reads are and take a look at the error rates in IGV

ADD REPLY
0
Entering edit mode

I extracted only the 150bp fragments from the bam file (who's fragment profile I have shown above) using the following command:

samtools view -h sample.bam | awk 'substr($0,1,1)=="@" || ($9==150) || ($9==-150)' | samtools view -b > sample_150bp.bam

I then visualized the sample_150bp.bam file in IGV. It seems these 150bp fragments are not enriched at any certain places. They seems to be quite evenly spread out throughout the genome. I also took a look at the information given by samtools stats, which reported an error rate of 4.35%.

Reads
total:  1,721,198   
filtered:   0   (0.0%)
non-primary:    0   
duplicated:     0   (0.0%)
mapped:     1,721,198   (100.0%)
zero MQ:    13,816  (0.8%)
avg read length:    125     
Bases
total:  215,530,686     
mapped:     169,547,505     (78.7%)
error rate:     4.35%

I am no way closer to understanding from where this spike at 150bp comes from. Its apparently not duplicates because its still there when removing duplicates.

I am now thinking to regard this as some kind of artifact and perhaps to just remove all 150bp fragments and see whether it have an impact on my downstream analyses or not. Judging by the fact that they seems to be evenly spread out in in the genome I would guess they are just contributing with background signal, so the removal of them may even improve my analyses such as peak calling etc.

Any comment on this idea?

ADD REPLY
0
Entering edit mode

Can you download some published ATAC-seq data in the same species and just check in IGV at a few locations if your reads line up with the expected nucleosome positions?

ADD REPLY
0
Entering edit mode
24 months ago
Simon ▴ 10

As I have still not received any explanation regarding the strange fragment size distribution described above, I thought to revive this post by adding some additional distribution plots. I tested to use the picard tool CollectInsertSizeMetrics:

picard CollectInsertSizeMetrics I=sample.bam O=sample_fragdist_metrics.txt H=sample_fragdisthisto.pdf

Below is the distribution when using an unfiltered bam file. Here, read orientation is also displayed (which I still need to understand what it means). The huge spike at 150 bp (actually 147-150 bp I have realized) is still there.

enter image description here

Next, I did the same for the filtered bam file (mitochondrial reads and reads in blacklisted regions removed) and the generated distribution is below. The change is similar to what is seen in my original post.

enter image description here

Is there anyone that can explain to me what the FR and RF orientation corresponds to? Is it the + and - strands? It looks like the RF-orientation abruptly stops at 150 bp, and I don't know what to make out of this.

Hope this additional information could help solving this frustrating fragment distribution problem.

ADD COMMENT
0
Entering edit mode

Don't kill yourself on these things. If you have the library prep QC results (Bioanalyzer/TapeStation) available you can double-check whether this is already in the library (unlikely) or whether the aligner is doing something strange. I always use --very-sensitive with bowtie2. Maybe the spike is some edge case coming from the local alignment together with how the aligner calculates the insert sizes, I do not know. I would proceed with downstream analysis and only go back if something looks very wrong there. Check data on the genome browser, do PCA and MA-plots, and then see if this spike merits a closer look or not.

ADD REPLY

Login before adding your answer.

Traffic: 1675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6