Question

ATAC-Seq trimming reads to 36bp

0

Entering edit mode

4.8 years ago

Bioinfo_2006 ▴ 160

I have ATAC-Seq data of 75bp length mapped using bowtie 2, and I had posted earlier (Insert Size Metrics of ATAC-Seq data) because I had the issue of flipped reads. This was fixed when I trimmed the reads to 36bp as recommended, reads were trimmed after removing for adapter contamination. I would like to understand how trimming the reads to 36bp fixed the problem. I do see from ENCODE pipeline that they trim the reads but do not see many publications doing this. Just want to see what is standard, and if it would be an issue for downstream analysis or publishing?

From fastqc had nextera adaptors

##used trim galore to remove them

java -jar trimmomatic-0.38.jar PE P1.fastq P2.fastq P1_output_forward_paired.fastq P1_output_forward_unpaired.fq                                                                                                         
P2_output_reverse_paired.fastq  P2_output_reverse_unpaired.fq  ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:1:true TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25

## used the paired-reads for bowtie 2

bowtie2 -p 4 -q --very-sensitive  -x $TRANS_DIR/genome -1 P_output_forward_paired.fastq -2 P_output_reverse_paired.fastq -S P_aln_unsorted.sam

##remove MT chromosomes

grep -v 'MT' P_unsorted.sam | samtools view -b -h -F 4 -f 0x2 - >P_aln_unsorted.bam

##sam to bam

java -jar picard.jar SortSam I=P_aln_unsorted.bam O=P_aln_sorted.bam SORT_ORDER=coordinate

##markduplicates

java -jar picard.jar MarkDuplicates INPUT=P_aln_sorted.bam OUTPUT=P_aln_sorted_rem.bam M=P_metrics.txt REMOVE_DUPLICATES=true

## index and flagstat

samtools index P_aln_sorted_rem.bam
samtools flagstat P_aln_sorted_rem.bam >P_aln_sorted_rem

##insertsizemetrics

java -jar picard.jar CollectInsertSizeMetrics I=P_aln_sorted_rem.bam O=P_insert_size_metrics.txt H=P_histogram.pdf

ATAC-SEQ • 3.8k views

ADD COMMENT • link 4.8 years ago by Bioinfo_2006 ▴ 160

score 1 · Answer 1 · 2019-07-29

1

Entering edit mode

4.8 years ago

Friederike 8.9k

You should probably not be trimming to a fixed size of 36 bp (that was just the test that Igor recommended in the post you're referencing). What you should be trimming off are adapters and sequencing primers, which will be part of your 75bp-read if the insert size of your fragment of interest is small. You need to remember that the entire piece of DNA, which is subjected to sequencing, does not just contain your cDNA of interest; in order to (a) physically attach it to the flowcell and (b) initiate the DNA amplification step that underlies the sequencing-by-synthesis approach that Illumina uses your cDNA of interest has been complemented with adapters and primers (read more about it, e.g. here or have a look at Illumina's great poster here). Therefore, if your cDNA of interest is shorter than the read length, the final read will be a mix of your cDNA of interest as well as the sequences that were attached to its ends.

Consider this extreme example where I'm using fictitious lengths to roughly illustrate what's happening if a given insert size is below the 75 bp that your read is long.

adapter | seq. primer 1 |        insert/fragment                  | seq. primer 2
(10bp)     (20 bp)                         (55bp)                      (20bp)

------------===========.........................................====================

                        --------------------------------------------------------> read 1 (75 bp)
              <----------------------------------------------------  read 2 (75 bp)

ADD COMMENT • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

I agree with that. As I already said in that original thread, I personally never had these issues you mentioned. Trimming to 36bp only takes away observations and by this power during the alignment step as longer reads (which are still the majority in ATAC-seq, only like 20% come from nucleosome-free regions = < 100bp) align better than these ultrashort reads. I still suspect that these was something flawed during your lowlevel processing. In the last month I've analyzed dozens of ATAC-seq samples from human and mouse, primary and cell lines, published and our own data, never had that issue.

ADD REPLY • link 4.8 years ago by ATpoint 82k

0

Entering edit mode

I did try only the adaptor trimming, but still had issues with flipped reads, and only then resorted to trimming the reads. I am going to redo the entire pipeline again to see where am going wrong. Do you have a preference between bowtie2 and BWA?

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

I use bowtie2 but only because rarely bwa needs a lot of memory (tens to little below 100Gb). Since my scripts are highly parallelized, it happened at times that multiple bwa threads consumed so much memory that it crashed the node. With bowtie2 I never had that. Beyond that no preference. Both are totally fine.

ADD REPLY • link 4.8 years ago by ATpoint 82k

0

Entering edit mode

Did you check the FastQC report what it has to say about over-represented sequences? Make sure all of these are captured in whatever you're trimming off.

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

edited the post to include what I did (did not include the part where I trimmed the reads). The fastqc report had adapter content that I removed, and checked fastqc again. unable to figure out what am doing wrong as it is very straight forward.

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

Did the adapter content vanish after adapter removal? I am not familiar with trimmomatic but you did not specify which adapter to trim (Nextera).

ADD REPLY • link 4.8 years ago by ATpoint 82k

0

Entering edit mode

Think it is an editing issue, it shows up now. I did remove the adapters,d id a fastqc, and do not see it anymore. I redid the whole thing without trimming the reads, and I have the Read flips again with the blue region

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

What was the read length distribution after trimmomatic? (FastQC will tell you)

If you pull out some of those "flipped" reads after alignment, what does their sequence/alignment look like?

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

Sequence length is between 25-75 after trimming for adapters because I had specified a minimum length of 25. The adapter is completely gone. Last time when I trimmed it to 36 bp, those flipped reads were gone. Am unable to figure out where it comes from.

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

Can you identify the flipped reads in the BAM file? And actually look at their sequence? Are you sure they were actually aligned when you trimmed to 36bp? (Odds are they weren't aligned)

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

Am going to pull out those reads, and also going to check mapping parameters for bowtie2. As Igor mentioned, the fragments could be shorter than the read length, and hence I see the flip whereas once I trim to 36bp I no longer see it.

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

If the only issue was that some fragments are shorter than read length, proper adapter and primer trimming should have taken care of that. There's something else going on here -- either the adapter trimming isn't sufficient/comprehensive or...else...

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

I see that the adapter is completely gone, no overrep sequences, and everything else in fastqc is a green.

Was just going through the ENCODE pipeline where they trim the reads to 30 bp and this discussion at What is the reason for trimming reads to 30 bp for ATAC-seq aligning? https://www.encodeproject.org/pipelines/ENCPL035XIO/

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

So, what is your alignment rate? (the discussion you just referenced seems to go into the direction of my gut feeling, too: it's not that you're really gaining anything by trimming to 36bp, you're just making sure you're definitely losing the ones with funny behavior.

The "flipped" reads may not be abundant enough to trigger the diagnostics in FastQC. You will have to look at a couple of examples of the flipped reads to see what's going on.

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

the alignment rate is between 97-99% using untrimmed reads, and 100% in terms of trimmed reads

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

I also checked for chrM using Samtools idxstats of the sorted bam file, and there are no MT reads present.

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

Mapping_stats

Insert_size_metrics

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

Thanks for the detailed explanation

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160

score 0 · Answer 2 · 2019-07-31

0

Entering edit mode

4.8 years ago

Bioinfo_2006 ▴ 160

Adding --nocontain option as mapping parameter in Bowtie2 solves the problem (without trimming the reads), however I have an equal percentage of multi-mappers, and unique ones. Would that be an issue?

ADD COMMENT • link 4.8 years ago by Bioinfo_2006 ▴ 160

0

Entering edit mode

what are the actual percentages? 50-50? 1%?

ADD REPLY • link 4.8 years ago by Friederike 8.9k

0

Entering edit mode

almost 50-50

ADD REPLY • link 4.8 years ago by Bioinfo_2006 ▴ 160