I'm analyzing human cancer cell line ATAC-seq data (PE 75bp Illumina) and have encountered a weird insert size distribution. For ATAC-seq PE data, insert size distribution is a good quality control of how well the experiment worked (This is how it would ideally look https://dbrg77.wordpress.com/2017/02/10/atac-seq-insert-size-plotting/). I have used several mapping strategies and always come to a similar distribution plot.
Sequencing is good quality with 50-60 million reads per sample with min. 3 replicates and mapping >80% for all samples. > 95% of forward and reverse reads are paired.
java -jar $tool_path/trimmomatic-0.36.jar PE -phred33 -threads 20 $forward $reverse $forward_1_paired.fastq $forward_unpaired.fastq $reverse_paired.fastq $reverse_unpaired.fastq ILLUMINACLIP:$tool_path/adapters/TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 MINLEN:50 bowtie2 --threads 20 --very-sensitive -x $bowtie2index -1 $forward_paired -2 $reverse_paired -S $filename_aligned.sam
samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt
Alternative way of plotting
java -Xmx20g -jar picard.jar CollectInsertSizeMetrics I=$file O=insert_size_metrics.txt H=insert_size_histogram.pdf
Insert size distribution *In the plot it should be written insert length, not fragment length.
Does anyone have an idea what went wrong here? Could this data be still useful to analysis? I was thinking to extract PE reads with insert size between 0-300nt and proceed only with these reads for downstream analysis?