What is the reason for trimming reads to 30 bp for ATAC-seq aligning?
3
10
Entering edit mode
6.3 years ago

I'm looking at the encode ATAC-seq data analysis pipeline (https://www.encodeproject.org/pipelines/ENCPL035XIO/) and they recommend you trim the reads to 30bp. I'm having trouble understanding the reasoning behind this? You'll get a better alignment rate with shorter basepairs but wouldn't that induce bias?

ATAC-seq alignment • 9.0k views
2
Entering edit mode

I don't think I've seen any other ATAC-seq papers where they do that, so it doesn't seem to be a common practice. Would be curious to know why ENCODE does it.

0
Entering edit mode

Yeah it seems a little weird to me as well but ENCODE is supposed to be gold standard so I'm a little confused. Our project in ATAC has resulted in a low alignment rate (56%), so we were gonna try their pipeline but I can't rationalize changing something that drastic without a legitimate reason.

4
Entering edit mode

ENCODE is more like a tin standard...

0
Entering edit mode

Have you tried to see what is in the 44% reads that are not aligning (and why)?

0
Entering edit mode

Yes we have tried it a little bit, but to little results. To give some background on where we've gotten so far:

1. This is a pilot study for ATAC-seq in dogs (as far as we're know, it hasn't been done in canine). So some of this could be due to a technical prep problem or something..
2. In the literature, there is some variability between breads as compared to the reference genome (CanFam3.1). The blood samples that we have are from greyhound and pit bull. So we thought maybe this could be because the genome is built from a Boxer. We are less inclined to believe this however because we'd expect the alignment rate to then vary among the samples we have and this does not occur. All samples align between 54-56 %.

Any other suggestions? Thanks!

0
Entering edit mode

I've had good luck using STAR in cases like this. It has the nice property of saying why it doesn't align a good chunk of the reads, so you can get an idea of whether you need to allow more soft-clipping or not. I think this is preferable since you end up tossing less of the sequencing.

0
Entering edit mode

Thanks for your reply, we tried that but didn't notice anything too important. We checked the quality scores of all samples and they are good. Here is the star output, do you see anything else that would resemble a problem?https://postimg.org/image/plsm95551/. I've only recently started using STAR so I'm still trying to fully understand the output.

Right off the bat, I noticed reads don't allign because of being 'too short' and after consulting Alex, the developer, it seemed it would be a quality score issue but that wasn't the case. Here is my questions with Alex

0
Entering edit mode

The "too short" thing can be remedied by setting --outFilterMatchNminOverLread to a smaller fraction.

0
Entering edit mode

So, my default was with bowtie2, i sensitive setting, I then changed it to sensitive-local mapping, which relaxes the reads ability to map. From the looks of it, --outFilterMatchNminOverLread is similar to this. While I got an increased read percentage, when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise..

I think you responded to me inquiry @DevonRyan when I originally asked about the bowtie parameter. Do you think it would be worth rerunning star to get try it?

Further, my worry going forward with star besides looking into alignment stats is that star is designed for RNA-seq, so some I'm assuming some are going to make this suboptimal in theory for ATAC-seq. Example being the varying sizes of fragment length ATAC libraries create.

0
Entering edit mode

If you are running out of options I suggest trying BBMap. You may be surprised at what it can do. It is easy to use and fast to boot. Brian Bushnell (author) actively participates here and would be a good resource.

0
Entering edit mode

Interesting, thank you I will look into it. Our worry at the moment isn't necessarily what the problem is but more about why the low alignment rate. The thing is that with the 50% alignment rate, we are still getting good peaks that the majority map within or near a TSS, as you'd expect, and the data actually looks good.

Our main problem right now is deciding what is worth doing, and spending time researching/looking into, when we could be doing more analysis with the data we have.

Once again, have to thank you guys for actively responding, appreciate all the questions you guys have answered for me

0
Entering edit mode

In order to diagnose that 50% alignment rate properly you do need to look at the pile of reads that is not aligning.

You still have not said anything about what the 50% of the reads contain that are not aligning (unless I missed that in this already long thread). Have you collected those reads and blasted a few to see what comes up? You want to be sure that they are not weird chimeras etc.

0
Entering edit mode

Oh yeah, sorry forgot to add that. I did try that. Of the blast search to the canine genome of the first thirty sequences that didn't align. 12/30 hits mapped to one single area of the genome

0
Entering edit mode

I see so perhaps these reads are being excluded because they are multi-mapping? That would be a logical guess.

0
Entering edit mode

Yeah that's our guess. When I reduced the stringency of mapping parameters in bowtie by doing a local alignment I got an increased read percentage, but when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise

0
Entering edit mode

STAR should work better, the varying fragment size is a non-issue.

0
Entering edit mode

Are you sure that's the case? The developer at STAR seemed to say otherwise.

1
Entering edit mode

That's not him saying otherwise, just that 3 years ago he didn't personally have much experience with it.

1
Entering edit mode

We've seen a few ATAC-Seq libraries with small inserts (~40bp), so (just a guess) maybe ENCODE used fixed-length trimming to accommodate such libraries and avoid adapter readthrough.

0
Entering edit mode

Usually when people do this it's because their underlying read quality is absolute crap. If your library prep didn't wreck you sequencing quality then just ignore everything ENCODE did and do something that makes more sense.

0
Entering edit mode

Looks like @datascientist28 already tried that so ENCODE appears to be plan B.

0
Entering edit mode

I would try to also ask on the official ENCODE help mailing list: https://mailman.stanford.edu/mailman/listinfo/encode-help

They were very helpful the one time I posted there.

0
Entering edit mode

I know of another study that used reads of 25bp, but not ATAC-seq.

By using shorter reads coupled with the stringency for 'unique mapping', you're merely increasing the likelihood that your read mappings are indeed 'perfectly aligned'. As it's more difficult to uniquely align shorter reads, any reads that do make it through should genuinely uniquely map.

Using shorter reads coupled with the subsequent QC requirements for 'uniqueness' thus helps to reduce alignment ambiguity and increase precision of peak identification, but at the expensive of read depth, which, for modern day sequencers, is no issue.

9
Entering edit mode
6.3 years ago

My 2p, not really answering the question about the 30bp trimming. I had good results (=sensible) processing ATAC-Seq as pretty much a standard ChIP-Seq. This is my pipeline for reads of 75 bp:

• Trim adapters with cutadapt (or similar)

• Align with bwa mem with default settings

• Discard alignments with flag 2304 (not primary or supplementary alignment) and mapq < 10

• If a "blacklist" of ChIP-seq genomic regions is available, remove reads in these regions

• Mark duplicates with picard

• Remove duplicates and reads on mitochondrial genome (ATAC-Seq maps a lot of reads on chrM, up to 40-60% and I prefer to remove them before peak calling)

• Call peaks with macs2. For human samples I find in tens to hundreds of thousands of peaks looking pretty sharp, often in proximity of TSS.

More detail here https://github.com/sblab-bioinformatics/dna-secondary-struct-chrom-lands/blob/master/Methods.md where I put this as supplementary material to a paper (1) we recently published using ATAC-Seq (and other stuff).

(1) Hansel-Hertsch et al. G-quadruplex structures mark human regulatory chromatin. Nat Genet 2016.

0
Entering edit mode

Awesome thank you very much! Is there a reason you use BWA mem over bowtie2? Just curiously.

0
Entering edit mode

0
Entering edit mode

About bwa mem vs bowtie2, I don't have a strong reason to prefer one or the other, really. I think they both work well. At some point I had a reason to prefer bwa mem which I can't remember and probably wasn't very compelling.

5
Entering edit mode
6.3 years ago
ATpoint 68k

The reason for that is to get rid of the Nextera and transposase adapter sequences. In the experimental workflow of ATAC-seq, you use a transposase that integrates Illumina adapters into regions of open chromatin while simultaneously fragmentating these regions. The result are fragments from these open regions, being flanked with adapters that can be targeted by suitable PCR primers to make these regions ready for sequencing. The workflow creates fragment of different lengths, see the original paper Figures 1 and 2.

If you then sequence your samples, with lets say Nextseq 75bp cycles, but your actual sequence (sequence between the adapters) is only 60bp long, then you also pick up 15bp of the adapters, which will lead to false alignment results.

=> Therefore, you must get rid of any adapter content prior to aligning for fastq files. The actual Nextera adapter sequence (if using standard Illumina Nextera Sample Prep Kit) is CTGTCTCTTATA on both strands. You can use tools,such as Cutadapt (TrimGalore) or Skewer. I personally use Skewer with these options to trim the Nextera transposase sequences:

skewer -x CTGTCTCTTATA -y CTGTCTCTTATA -m pe -q 20


I think they used 30bp because 30 is probably the minimal fragment length that the assay created in their hands, so it is kind of the length that does for sure not contain adapter contaminations. Still, I would use adapter trimming software instead of fixed trimming sizes (which pretty much all publications did so far).

1
Entering edit mode

If the concern was adapters, they would just do adapter trimming. However, they are trimming all sequences. That's the odd part.

0
Entering edit mode

Well, if they sequenced paired-end and 30bp is still enough for a proper alignment, then it should not matter so much.

1
Entering edit mode

Even if that was the case I still argue why would ENCODE, who are supposed to be institutions coming together to help people like me who have questions, publish this random step in the official ENCODE pipeline? haha

Just weird to me that they didn't even explain that step and i can't find anything resembling this step in the literature.

1
Entering edit mode
6.3 years ago
Charles Plessy ★ 2.9k

I can not speak for the people who developed the ENCODE pipeline, but I think that one of their reasons for trimming the reads is that they use a short-read aligner.

They use Bowtie 1, and the FAQ if its successor, Bowtie 2, notes: "for reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive". So if one sticks to Bowtie 1, that definitely calls for trimming, at least to 50.

Moreover, the parameters of Bowtie 1 in the ENCODE pipeline are -­‐chunkmbs 1024 ‐y ‐v 2 ‐p 7 -‐best -­‐strata -m 3 -­k 1 --sam-­nohead -­‐sam, where -v 2 sets the maximum mistatches to 2, and I would certainly would not align longer, untrimmed reads with such a setting, because it would start to discard reads only for a few sequencing errors or polymorphisms.

0
Entering edit mode

Yeah I agree, that was my original thought because of bowtie1 being made for shorter read lengths. It still begs me to question, why would they trim an additional 20 bp lower? Also why is this step submitted to the 'official' pipeline for our field? (not literally asking you just letting out some code-rage lol).

2
Entering edit mode

Two quite wild guesses for the sake of the brainstorm:

• they wanted to save disk space, and thought that paired 2 × 30 alignments were already stringent enough;
• they wanted a standard size that is lower than any of their sequencer output, (and with some MiSeq kits this can be 2 × 35), so that comparisons across experiments are easier.
1
Entering edit mode

That could be very true. I work at University of Washington and Cole Trapnell (creator of tophat, cufflinks, etc) is a PI here, I've been consulting with him and some of his post docs on their best practices for ATAC pipelines.. They recently told me that they trim there reads to 36 basepairs for ATAC alignment. I'm waiting on his reply and I'll let you know his explanation for doing so

1
Entering edit mode

Any news ? Everybody have their own speculation, but I am very curious to learn the historical reason !

0
Entering edit mode

Still waiting on the lab's reply. I'll be sure to let you know when I hear back!

0
Entering edit mode

I'm also suspicious of these instructions in their pipeline since the "--strata" and "-k 1" options are incompatible. Not sure if we should ditch "--strata" or use a higher value for "-k" in case of overlaps. What do you guys normally use to align ATAC-Seq data?

1
Entering edit mode

At the moment I am using BWA, but this is just a first start and probably a naive choice. I have just added a description on GitHub at the following location: https://github.com/Population-Transcriptomics/scATAC-mouse-human-control/blob/master/sequenceAlignment.md.

0
Entering edit mode

BWA mem, if that's what you're using, should not be used for reads shorter than 70bp.