Question: ATAC-seq fragment length distribution
0
gravatar for Grinch
11 months ago by
Grinch60
Germany
Grinch60 wrote:

Hi,

I'm analyzing human cancer cell line ATAC-seq data (PE 75bp Illumina) and have encountered a weird insert size distribution. For ATAC-seq PE data, insert size distribution is a good quality control of how well the experiment worked (This is how it would ideally look https://dbrg77.wordpress.com/2017/02/10/atac-seq-insert-size-plotting/). I have used several mapping strategies and always come to a similar distribution plot.

Sequencing is good quality with 50-60 million reads per sample with min. 3 replicates and mapping >80% for all samples. > 95% of forward and reverse reads are paired.

Analysis

java -jar $tool_path/trimmomatic-0.36.jar PE -phred33 -threads 20 $forward $reverse $forward_1_paired.fastq $forward_unpaired.fastq $reverse_paired.fastq $reverse_unpaired.fastq ILLUMINACLIP:$tool_path/adapters/TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 MINLEN:50

bowtie2 --threads 20 --very-sensitive -x $bowtie2index -1 $forward_paired -2 $reverse_paired -S $filename_aligned.sam

Plotting

samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt

Alternative way of plotting

java -Xmx20g -jar picard.jar CollectInsertSizeMetrics I=$file O=insert_size_metrics.txt H=insert_size_histogram.pdf

Insert size distribution *In the plot it should be written insert length, not fragment length.

Does anyone have an idea what went wrong here? Could this data be still useful to analysis? I was thinking to extract PE reads with insert size between 0-300nt and proceed only with these reads for downstream analysis?

atac-seq • 1.7k views
ADD COMMENTlink modified 9 weeks ago • written 11 months ago by Grinch60

Please share your code for adapter trimming, mapping and generating this figure. Then we'll see. Still, this looks suspicious and I've never seen it in our ATAC-seq data, be it mouse or human.

ADD REPLYlink modified 11 months ago • written 11 months ago by ATpoint19k

I've added the information in the description.

ADD REPLYlink written 11 months ago by Grinch60

The only odd thing I see is that you trimmed for the TruSeq instead of the Nextera adapter. Sequence to trim on both ends is CTGTCTCTTATACACATCT. This makes a difference when calculating insert sizes, but typically the nucleosomal pattern should still be visible. How did the gel or Bioanalyzer picture look prior to sequencing (hope you did that quality control)?

ADD REPLYlink written 11 months ago by ATpoint19k

Hi , There is few question to ask ; -is your sequencing is good ? ( high coverage of you insertion ? ) -is your model organism got a "complete" genome ref ? ( correct assembly? high concentration of TE/ high genome dynamic ? )

ADD REPLYlink written 11 months ago by Titus890

I've added the information in the description.

ADD REPLYlink written 11 months ago by Grinch60

So you are with data with hi heterogeneity/ high rearrangement because of cancer cells , did you check with for example IGV the coverage of one insertion ?

ADD REPLYlink written 11 months ago by Titus890
1

I do not think that the cell-of-origin is the reason here. We've done ATAC-seq in different cancer cell lines, and the pattern was always clear and distinct. ATAC-seq still looks at the genome/regulome, and single rearrangements IMHO do not influence the overall picture.

ADD REPLYlink written 11 months ago by ATpoint19k

Ok i taught it could explain the differences ...

ADD REPLYlink written 11 months ago by Titus890

Was the plot from your custom code or the data from Picard? What does the output of bamPEFragmentSize from deepTools look like? What does the output of samtools idxstats look like? Are you sure this is actually your sample and not someone else's (it happens, though rarely)?

ADD REPLYlink written 11 months ago by Devon Ryan91k
3
gravatar for Grinch
9 weeks ago by
Grinch60
Germany
Grinch60 wrote:

For anyone looking at this post in the future, I will explain how I solved this issue: The issue was in the raw data I was given. The data, as i mentioned above was PE sequencing and i was given two files per sample, marked as _1 and _2, and the information that these are forward and reverse reads. I processed them as such and gotten no errors and good mapping results. But in fact, what I realised later is that these were lane 1 and lane 2 files and both files contained both forward and reverse reads. Oddly i didn't get any errors while removing adapters or mapping, despite the fact that i treated them as if file 1 contained forward reads and file 2 reverse reads. The data now looks as it should, with the typical nucleosome phasing pattern and good quality.

So lesson learned, always double check the information you are given together with the files.

ADD COMMENTlink written 9 weeks ago by Grinch60

What was the percentage of properly-paired reads in the original alignment? I guess something close to zero?

ADD REPLYlink written 9 weeks ago by ATpoint19k

Surprisingly not. More than 95% were paired, out of those more than 50% were properly paired. I would also expect much lower numbers. To me, it looks as if someone has half way sorted the files by forward/reverse reads, but not completely. I find it hard to believe that these are the files directly from the sequencer.

ADD REPLYlink written 9 weeks ago by Grinch60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1553 users visited in the last hour