I have ATAC-Seq data of 75bp length mapped using bowtie 2, and I had posted earlier (Insert Size Metrics of ATAC-Seq data) because I had the issue of flipped reads. This was fixed when I trimmed the reads to 36bp as recommended, reads were trimmed after removing for adapter contamination. I would like to understand how trimming the reads to 36bp fixed the problem. I do see from ENCODE pipeline that they trim the reads but do not see many publications doing this. Just want to see what is standard, and if it would be an issue for downstream analysis or publishing?
From fastqc had nextera adaptors
##used trim galore to remove them
java -jar trimmomatic-0.38.jar PE P1.fastq P2.fastq P1_output_forward_paired.fastq P1_output_forward_unpaired.fq
P2_output_reverse_paired.fastq P2_output_reverse_unpaired.fq ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:1:true TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25
## used the paired-reads for bowtie 2
bowtie2 -p 4 -q --very-sensitive -x $TRANS_DIR/genome -1 P_output_forward_paired.fastq -2 P_output_reverse_paired.fastq -S P_aln_unsorted.sam
##remove MT chromosomes
grep -v 'MT' P_unsorted.sam | samtools view -b -h -F 4 -f 0x2 - >P_aln_unsorted.bam
##sam to bam
java -jar picard.jar SortSam I=P_aln_unsorted.bam O=P_aln_sorted.bam SORT_ORDER=coordinate
##markduplicates
java -jar picard.jar MarkDuplicates INPUT=P_aln_sorted.bam OUTPUT=P_aln_sorted_rem.bam M=P_metrics.txt REMOVE_DUPLICATES=true
## index and flagstat
samtools index P_aln_sorted_rem.bam
samtools flagstat P_aln_sorted_rem.bam >P_aln_sorted_rem
##insertsizemetrics
java -jar picard.jar CollectInsertSizeMetrics I=P_aln_sorted_rem.bam O=P_insert_size_metrics.txt H=P_histogram.pdf
I agree with that. As I already said in that original thread, I personally never had these issues you mentioned. Trimming to 36bp only takes away observations and by this power during the alignment step as longer reads (which are still the majority in ATAC-seq, only like 20% come from nucleosome-free regions = < 100bp) align better than these ultrashort reads. I still suspect that these was something flawed during your lowlevel processing. In the last month I've analyzed dozens of ATAC-seq samples from human and mouse, primary and cell lines, published and our own data, never had that issue.
I did try only the adaptor trimming, but still had issues with flipped reads, and only then resorted to trimming the reads. I am going to redo the entire pipeline again to see where am going wrong. Do you have a preference between bowtie2 and BWA?
I use
bowtie2
but only because rarelybwa
needs a lot of memory (tens to little below 100Gb). Since my scripts are highly parallelized, it happened at times that multiplebwa
threads consumed so much memory that it crashed the node. Withbowtie2
I never had that. Beyond that no preference. Both are totally fine.Did you check the FastQC report what it has to say about over-represented sequences? Make sure all of these are captured in whatever you're trimming off.
edited the post to include what I did (did not include the part where I trimmed the reads). The fastqc report had adapter content that I removed, and checked fastqc again. unable to figure out what am doing wrong as it is very straight forward.
Did the adapter content vanish after adapter removal? I am not familiar with trimmomatic but you did not specify which adapter to trim (Nextera).
Think it is an editing issue, it shows up now. I did remove the adapters,d id a fastqc, and do not see it anymore. I redid the whole thing without trimming the reads, and I have the Read flips again with the blue region
What was the read length distribution after trimmomatic? (FastQC will tell you)
If you pull out some of those "flipped" reads after alignment, what does their sequence/alignment look like?
Sequence length is between 25-75 after trimming for adapters because I had specified a minimum length of 25. The adapter is completely gone. Last time when I trimmed it to 36 bp, those flipped reads were gone. Am unable to figure out where it comes from.
Can you identify the flipped reads in the BAM file? And actually look at their sequence? Are you sure they were actually aligned when you trimmed to 36bp? (Odds are they weren't aligned)
Am going to pull out those reads, and also going to check mapping parameters for bowtie2. As Igor mentioned, the fragments could be shorter than the read length, and hence I see the flip whereas once I trim to 36bp I no longer see it.
If the only issue was that some fragments are shorter than read length, proper adapter and primer trimming should have taken care of that. There's something else going on here -- either the adapter trimming isn't sufficient/comprehensive or...else...
I see that the adapter is completely gone, no overrep sequences, and everything else in fastqc is a green.
Was just going through the ENCODE pipeline where they trim the reads to 30 bp and this discussion at What is the reason for trimming reads to 30 bp for ATAC-seq aligning? https://www.encodeproject.org/pipelines/ENCPL035XIO/
So, what is your alignment rate? (the discussion you just referenced seems to go into the direction of my gut feeling, too: it's not that you're really gaining anything by trimming to 36bp, you're just making sure you're definitely losing the ones with funny behavior.
The "flipped" reads may not be abundant enough to trigger the diagnostics in FastQC. You will have to look at a couple of examples of the flipped reads to see what's going on.
the alignment rate is between 97-99% using untrimmed reads, and 100% in terms of trimmed reads
I also checked for chrM using Samtools idxstats of the sorted bam file, and there are no MT reads present.
Thanks for the detailed explanation