I'm looking at the encode ATAC-seq data analysis pipeline (https://www.encodeproject.org/pipelines/ENCPL035XIO/) and they recommend you trim the reads to 30bp. I'm having trouble understanding the reasoning behind this? You'll get a better alignment rate with shorter basepairs but wouldn't that induce bias?
My 2p, not really answering the question about the 30bp trimming. I had good results (=sensible) processing ATAC-Seq as pretty much a standard ChIP-Seq. This is my pipeline for reads of 75 bp:
Trim adapters with
bwa memwith default settings
Discard alignments with flag 2304 (not primary or supplementary alignment) and mapq < 10
If a "blacklist" of ChIP-seq genomic regions is available, remove reads in these regions
Mark duplicates with picard
Remove duplicates and reads on mitochondrial genome (ATAC-Seq maps a lot of reads on chrM, up to 40-60% and I prefer to remove them before peak calling)
Call peaks with macs2. For human samples I find in tens to hundreds of thousands of peaks looking pretty sharp, often in proximity of TSS.
More detail here https://github.com/sblab-bioinformatics/dna-secondary-struct-chrom-lands/blob/master/Methods.md where I put this as supplementary material to a paper (1) we recently published using ATAC-Seq (and other stuff).
(1) Hansel-Hertsch et al. G-quadruplex structures mark human regulatory chromatin. Nat Genet 2016.
The reason for that is to get rid of the Nextera and transposase adapter sequences. In the experimental workflow of ATAC-seq, you use a transposase that integrates Illumina adapters into regions of open chromatin while simultaneously fragmentating these regions. The result are fragments from these open regions, being flanked with adapters that can be targeted by suitable PCR primers to make these regions ready for sequencing. The workflow creates fragment of different lengths, see the original paper Figures 1 and 2.
If you then sequence your samples, with lets say Nextseq 75bp cycles, but your actual sequence (sequence between the adapters) is only 60bp long, then you also pick up 15bp of the adapters, which will lead to false alignment results.
=> Therefore, you must get rid of any adapter content prior to aligning for fastq files. The actual Nextera adapter sequence (if using standard Illumina Nextera Sample Prep Kit) is CTGTCTCTTATA on both strands. You can use tools,such as Cutadapt (TrimGalore) or Skewer. I personally use Skewer with these options to trim the Nextera transposase sequences:
skewer -x CTGTCTCTTATA -y CTGTCTCTTATA -m pe -q 20
I think they used 30bp because 30 is probably the minimal fragment length that the assay created in their hands, so it is kind of the length that does for sure not contain adapter contaminations. Still, I would use adapter trimming software instead of fixed trimming sizes (which pretty much all publications did so far).
I can not speak for the people who developed the ENCODE pipeline, but I think that one of their reasons for trimming the reads is that they use a short-read aligner.
They use Bowtie 1, and the FAQ if its successor, Bowtie 2, notes: "for reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive". So if one sticks to Bowtie 1, that definitely calls for trimming, at least to 50.
Moreover, the parameters of Bowtie 1 in the ENCODE pipeline are
-‐chunkmbs 1024 ‐y ‐v 2 ‐p 7 -‐best -‐strata -m 3 -k 1 --sam-nohead -‐sam, where
-v 2 sets the maximum mistatches to 2, and I would certainly would not align longer, untrimmed reads with such a setting, because it would start to discard reads only for a few sequencing errors or polymorphisms.