Question: What is the reason for trimming reads to 30 bp for ATAC-seq aligning?
8
gravatar for datascientist28
2.2 years ago by
University of Washington
datascientist28350 wrote:

I'm looking at the encode ATAC-seq data analysis pipeline (https://www.encodeproject.org/pipelines/ENCPL035XIO/) and they recommend you trim the reads to 30bp. I'm having trouble understanding the reasoning behind this? You'll get a better alignment rate with shorter basepairs but wouldn't that induce bias?

atac-seq alignment • 5.0k views
ADD COMMENTlink modified 9 months ago by Kevin Blighe33k • written 2.2 years ago by datascientist28350
2

I don't think I've seen any other ATAC-seq papers where they do that, so it doesn't seem to be a common practice. Would be curious to know why ENCODE does it.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by igor7.1k

Yeah it seems a little weird to me as well but ENCODE is supposed to be gold standard so I'm a little confused. Our project in ATAC has resulted in a low alignment rate (56%), so we were gonna try their pipeline but I can't rationalize changing something that drastic without a legitimate reason.

ADD REPLYlink written 2.2 years ago by datascientist28350
4

ENCODE is more like a tin standard...

ADD REPLYlink written 2.2 years ago by Devon Ryan86k

Have you tried to see what is in the 44% reads that are not aligning (and why)?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax59k

Yes we have tried it a little bit, but to little results. To give some background on where we've gotten so far:

  1. This is a pilot study for ATAC-seq in dogs (as far as we're know, it hasn't been done in canine). So some of this could be due to a technical prep problem or something..
  2. In the literature, there is some variability between breads as compared to the reference genome (CanFam3.1). The blood samples that we have are from greyhound and pit bull. So we thought maybe this could be because the genome is built from a Boxer. We are less inclined to believe this however because we'd expect the alignment rate to then vary among the samples we have and this does not occur. All samples align between 54-56 %.

Any other suggestions? Thanks!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350

I've had good luck using STAR in cases like this. It has the nice property of saying why it doesn't align a good chunk of the reads, so you can get an idea of whether you need to allow more soft-clipping or not. I think this is preferable since you end up tossing less of the sequencing.

ADD REPLYlink written 2.2 years ago by Devon Ryan86k

Thanks for your reply, we tried that but didn't notice anything too important. We checked the quality scores of all samples and they are good. Here is the star output, do you see anything else that would resemble a problem?https://postimg.org/image/plsm95551/. I've only recently started using STAR so I'm still trying to fully understand the output.

Right off the bat, I noticed reads don't allign because of being 'too short' and after consulting Alex, the developer, it seemed it would be a quality score issue but that wasn't the case. Here is my questions with Alex

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350

The "too short" thing can be remedied by setting --outFilterMatchNminOverLread to a smaller fraction.

ADD REPLYlink written 2.2 years ago by Devon Ryan86k

So, my default was with bowtie2, i sensitive setting, I then changed it to sensitive-local mapping, which relaxes the reads ability to map. From the looks of it, --outFilterMatchNminOverLread is similar to this. While I got an increased read percentage, when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise..

I think you responded to me inquiry @DevonRyan when I originally asked about the bowtie parameter. Do you think it would be worth rerunning star to get try it?

Further, my worry going forward with star besides looking into alignment stats is that star is designed for RNA-seq, so some I'm assuming some are going to make this suboptimal in theory for ATAC-seq. Example being the varying sizes of fragment length ATAC libraries create.

ADD REPLYlink written 2.2 years ago by datascientist28350

If you are running out of options I suggest trying BBMap. You may be surprised at what it can do. It is easy to use and fast to boot. Brian Bushnell (author) actively participates here and would be a good resource.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax59k

Interesting, thank you I will look into it. Our worry at the moment isn't necessarily what the problem is but more about why the low alignment rate. The thing is that with the 50% alignment rate, we are still getting good peaks that the majority map within or near a TSS, as you'd expect, and the data actually looks good.

Our main problem right now is deciding what is worth doing, and spending time researching/looking into, when we could be doing more analysis with the data we have.

Once again, have to thank you guys for actively responding, appreciate all the questions you guys have answered for me

ADD REPLYlink written 2.2 years ago by datascientist28350

In order to diagnose that 50% alignment rate properly you do need to look at the pile of reads that is not aligning.

You still have not said anything about what the 50% of the reads contain that are not aligning (unless I missed that in this already long thread). Have you collected those reads and blasted a few to see what comes up? You want to be sure that they are not weird chimeras etc.

ADD REPLYlink written 2.2 years ago by genomax59k

Oh yeah, sorry forgot to add that. I did try that. Of the blast search to the canine genome of the first thirty sequences that didn't align. 12/30 hits mapped to one single area of the genome

ADD REPLYlink written 2.2 years ago by datascientist28350

I see so perhaps these reads are being excluded because they are multi-mapping? That would be a logical guess.

ADD REPLYlink written 2.2 years ago by genomax59k

Yeah that's our guess. When I reduced the stringency of mapping parameters in bowtie by doing a local alignment I got an increased read percentage, but when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise

ADD REPLYlink written 2.2 years ago by datascientist28350

STAR should work better, the varying fragment size is a non-issue.

ADD REPLYlink written 2.2 years ago by Devon Ryan86k

Are you sure that's the case? The developer at STAR seemed to say otherwise.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350
1

That's not him saying otherwise, just that 3 years ago he didn't personally have much experience with it.

ADD REPLYlink written 2.2 years ago by Devon Ryan86k
1

We've seen a few ATAC-Seq libraries with small inserts (~40bp), so (just a guess) maybe ENCODE used fixed-length trimming to accommodate such libraries and avoid adapter readthrough.

ADD REPLYlink written 2.2 years ago by harold.smith.tarheel4.2k

Usually when people do this it's because their underlying read quality is absolute crap. If your library prep didn't wreck you sequencing quality then just ignore everything ENCODE did and do something that makes more sense.

ADD REPLYlink written 2.2 years ago by Devon Ryan86k

Looks like @datascientist28 already tried that so ENCODE appears to be plan B.

ADD REPLYlink written 2.2 years ago by genomax59k

I would try to also ask on the official ENCODE help mailing list: https://mailman.stanford.edu/mailman/listinfo/encode-help

They were very helpful the one time I posted there.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by igor7.1k

I know of another study that used reads of 25bp, but not ATAC-seq.

By using shorter reads coupled with the stringency for 'unique mapping', you're merely increasing the likelihood that your read mappings are indeed 'perfectly aligned'. As it's more difficult to uniquely align shorter reads, any reads that do make it through should genuinely uniquely map.

Using shorter reads coupled with the subsequent QC requirements for 'uniqueness' thus helps to reduce alignment ambiguity and increase precision of peak identification, but at the expensive of read depth, which, for modern day sequencers, is no issue.

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe33k
9
gravatar for dariober
2.2 years ago by
dariober9.7k
Glasgow - UK
dariober9.7k wrote:

My 2p, not really answering the question about the 30bp trimming. I had good results (=sensible) processing ATAC-Seq as pretty much a standard ChIP-Seq. This is my pipeline for reads of 75 bp:

  • Trim adapters with cutadapt (or similar)

  • Align with bwa mem with default settings

  • Discard alignments with flag 2304 (not primary or supplementary alignment) and mapq < 10

  • If a "blacklist" of ChIP-seq genomic regions is available, remove reads in these regions

  • Mark duplicates with picard

  • Remove duplicates and reads on mitochondrial genome (ATAC-Seq maps a lot of reads on chrM, up to 40-60% and I prefer to remove them before peak calling)

  • Call peaks with macs2. For human samples I find in tens to hundreds of thousands of peaks looking pretty sharp, often in proximity of TSS.

More detail here https://github.com/sblab-bioinformatics/dna-secondary-struct-chrom-lands/blob/master/Methods.md where I put this as supplementary material to a paper (1) we recently published using ATAC-Seq (and other stuff).

(1) Hansel-Hertsch et al. G-quadruplex structures mark human regulatory chromatin. Nat Genet 2016.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by dariober9.7k

Awesome thank you very much! Is there a reason you use BWA mem over bowtie2? Just curiously.

ADD REPLYlink written 2.2 years ago by datascientist28350

Also, do you have a link to your paper? I'd love to read it!

ADD REPLYlink written 2.2 years ago by datascientist28350

(Link to paper added, thanks for your interest!)

About bwa mem vs bowtie2, I don't have a strong reason to prefer one or the other, really. I think they both work well. At some point I had a reason to prefer bwa mem which I can't remember and probably wasn't very compelling.

ADD REPLYlink written 2.2 years ago by dariober9.7k
4
gravatar for ATpoint
2.2 years ago by
ATpoint11k
Germany
ATpoint11k wrote:

The reason for that is to get rid of the Nextera and transposase adapter sequences. In the experimental workflow of ATAC-seq, you use a transposase that integrates Illumina adapters into regions of open chromatin while simultaneously fragmentating these regions. The result are fragments from these open regions, being flanked with adapters that can be targeted by suitable PCR primers to make these regions ready for sequencing. The workflow creates fragment of different lengths, see the original paper Figures 1 and 2.

If you then sequence your samples, with lets say Nextseq 75bp cycles, but your actual sequence (sequence between the adapters) is only 60bp long, then you also pick up 15bp of the adapters, which will lead to false alignment results.

=> Therefore, you must get rid of any adapter content prior to aligning for fastq files. The actual Nextera adapter sequence (if using standard Illumina Nextera Sample Prep Kit) is CTGTCTCTTATA on both strands. You can use tools,such as Cutadapt (TrimGalore) or Skewer. I personally use Skewer with these options to trim the Nextera transposase sequences:

skewer -x CTGTCTCTTATA -y CTGTCTCTTATA -m pe -q 20

I think they used 30bp because 30 is probably the minimal fragment length that the assay created in their hands, so it is kind of the length that does for sure not contain adapter contaminations. Still, I would use adapter trimming software instead of fixed trimming sizes (which pretty much all publications did so far).

ADD COMMENTlink written 2.2 years ago by ATpoint11k
1

If the concern was adapters, they would just do adapter trimming. However, they are trimming all sequences. That's the odd part.

ADD REPLYlink written 2.2 years ago by igor7.1k

Well, if they sequenced paired-end and 30bp is still enough for a proper alignment, then it should not matter so much.

ADD REPLYlink written 2.2 years ago by ATpoint11k
1

Even if that was the case I still argue why would ENCODE, who are supposed to be institutions coming together to help people like me who have questions, publish this random step in the official ENCODE pipeline? haha

Just weird to me that they didn't even explain that step and i can't find anything resembling this step in the literature.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350
1
gravatar for Charles Plessy
2.2 years ago by
Charles Plessy2.6k
Japan
Charles Plessy2.6k wrote:

I can not speak for the people who developed the ENCODE pipeline, but I think that one of their reasons for trimming the reads is that they use a short-read aligner.

They use Bowtie 1, and the FAQ if its successor, Bowtie 2, notes: "for reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive". So if one sticks to Bowtie 1, that definitely calls for trimming, at least to 50.

Moreover, the parameters of Bowtie 1 in the ENCODE pipeline are -­‐chunkmbs 1024 ‐y ‐v 2 ‐p 7 -‐best -­‐strata -m 3 -­k 1 --sam-­nohead -­‐sam, where -v 2 sets the maximum mistatches to 2, and I would certainly would not align longer, untrimmed reads with such a setting, because it would start to discard reads only for a few sequencing errors or polymorphisms.

ADD COMMENTlink written 2.2 years ago by Charles Plessy2.6k

Yeah I agree, that was my original thought because of bowtie1 being made for shorter read lengths. It still begs me to question, why would they trim an additional 20 bp lower? Also why is this step submitted to the 'official' pipeline for our field? (not literally asking you just letting out some code-rage lol).

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350
2

Two quite wild guesses for the sake of the brainstorm:

  • they wanted to save disk space, and thought that paired 2 × 30 alignments were already stringent enough;
  • they wanted a standard size that is lower than any of their sequencer output, (and with some MiSeq kits this can be 2 × 35), so that comparisons across experiments are easier.
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Charles Plessy2.6k
1

That could be very true. I work at University of Washington and Cole Trapnell (creator of tophat, cufflinks, etc) is a PI here, I've been consulting with him and some of his post docs on their best practices for ATAC pipelines.. They recently told me that they trim there reads to 36 basepairs for ATAC alignment. I'm waiting on his reply and I'll let you know his explanation for doing so

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by datascientist28350
1

Any news ? Everybody have their own speculation, but I am very curious to learn the historical reason !

ADD REPLYlink written 2.2 years ago by Charles Plessy2.6k

Still waiting on the lab's reply. I'll be sure to let you know when I hear back!

ADD REPLYlink written 2.2 years ago by datascientist28350

I'm also suspicious of these instructions in their pipeline since the "--strata" and "-k 1" options are incompatible. Not sure if we should ditch "--strata" or use a higher value for "-k" in case of overlaps. What do you guys normally use to align ATAC-Seq data?

ADD REPLYlink written 21 months ago by phosphodiester_bond20
1

At the moment I am using BWA, but this is just a first start and probably a naive choice. I have just added a description on GitHub at the following location: https://github.com/Population-Transcriptomics/scATAC-mouse-human-control/blob/master/sequenceAlignment.md.

ADD REPLYlink written 21 months ago by Charles Plessy2.6k

BWA mem, if that's what you're using, should not be used for reads shorter than 70bp.

ADD REPLYlink written 9 months ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1183 users visited in the last hour