Question: ATAC-Seq trimming reads to 36bp
0
gravatar for Bioinfo_2006
10 months ago by
Bioinfo_2006140
Bioinfo_2006140 wrote:

I have ATAC-Seq data of 75bp length mapped using bowtie 2, and I had posted earlier (Insert Size Metrics of ATAC-Seq data) because I had the issue of flipped reads. This was fixed when I trimmed the reads to 36bp as recommended, reads were trimmed after removing for adapter contamination. I would like to understand how trimming the reads to 36bp fixed the problem. I do see from ENCODE pipeline that they trim the reads but do not see many publications doing this. Just want to see what is standard, and if it would be an issue for downstream analysis or publishing?

From fastqc had nextera adaptors

##used trim galore to remove them

java -jar trimmomatic-0.38.jar PE P1.fastq P2.fastq P1_output_forward_paired.fastq P1_output_forward_unpaired.fq                                                                                                         
P2_output_reverse_paired.fastq  P2_output_reverse_unpaired.fq  ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:1:true TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25

## used the paired-reads for bowtie 2

bowtie2 -p 4 -q --very-sensitive  -x $TRANS_DIR/genome -1 P_output_forward_paired.fastq -2 P_output_reverse_paired.fastq -S P_aln_unsorted.sam

##remove MT chromosomes

grep -v 'MT' P_unsorted.sam | samtools view -b -h -F 4 -f 0x2 - >P_aln_unsorted.bam

##sam to bam

java -jar picard.jar SortSam I=P_aln_unsorted.bam O=P_aln_sorted.bam SORT_ORDER=coordinate

##markduplicates

java -jar picard.jar MarkDuplicates INPUT=P_aln_sorted.bam OUTPUT=P_aln_sorted_rem.bam M=P_metrics.txt REMOVE_DUPLICATES=true

## index and flagstat

samtools index P_aln_sorted_rem.bam
samtools flagstat P_aln_sorted_rem.bam >P_aln_sorted_rem

##insertsizemetrics

java -jar picard.jar CollectInsertSizeMetrics I=P_aln_sorted_rem.bam O=P_insert_size_metrics.txt H=P_histogram.pdf
atac-seq • 635 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by Bioinfo_2006140
1
gravatar for Friederike
10 months ago by
Friederike5.7k
United States
Friederike5.7k wrote:

You should probably not be trimming to a fixed size of 36 bp (that was just the test that Igor recommended in the post you're referencing). What you should be trimming off are adapters and sequencing primers, which will be part of your 75bp-read if the insert size of your fragment of interest is small. You need to remember that the entire piece of DNA, which is subjected to sequencing, does not just contain your cDNA of interest; in order to (a) physically attach it to the flowcell and (b) initiate the DNA amplification step that underlies the sequencing-by-synthesis approach that Illumina uses your cDNA of interest has been complemented with adapters and primers (read more about it, e.g. here or have a look at Illumina's great poster here). Therefore, if your cDNA of interest is shorter than the read length, the final read will be a mix of your cDNA of interest as well as the sequences that were attached to its ends.

Consider this extreme example where I'm using fictitious lengths to roughly illustrate what's happening if a given insert size is below the 75 bp that your read is long.

adapter | seq. primer 1 |        insert/fragment                  | seq. primer 2
(10bp)     (20 bp)                         (55bp)                      (20bp)

------------===========.........................................====================

                        --------------------------------------------------------> read 1 (75 bp)
              <----------------------------------------------------  read 2 (75 bp)
ADD COMMENTlink modified 10 months ago • written 10 months ago by Friederike5.7k

I agree with that. As I already said in that original thread, I personally never had these issues you mentioned. Trimming to 36bp only takes away observations and by this power during the alignment step as longer reads (which are still the majority in ATAC-seq, only like 20% come from nucleosome-free regions = < 100bp) align better than these ultrashort reads. I still suspect that these was something flawed during your lowlevel processing. In the last month I've analyzed dozens of ATAC-seq samples from human and mouse, primary and cell lines, published and our own data, never had that issue.

ADD REPLYlink written 10 months ago by ATpoint35k

I did try only the adaptor trimming, but still had issues with flipped reads, and only then resorted to trimming the reads. I am going to redo the entire pipeline again to see where am going wrong. Do you have a preference between bowtie2 and BWA?

ADD REPLYlink written 10 months ago by Bioinfo_2006140

I use bowtie2 but only because rarely bwa needs a lot of memory (tens to little below 100Gb). Since my scripts are highly parallelized, it happened at times that multiple bwa threads consumed so much memory that it crashed the node. With bowtie2 I never had that. Beyond that no preference. Both are totally fine.

ADD REPLYlink written 10 months ago by ATpoint35k

Did you check the FastQC report what it has to say about over-represented sequences? Make sure all of these are captured in whatever you're trimming off.

ADD REPLYlink written 10 months ago by Friederike5.7k

edited the post to include what I did (did not include the part where I trimmed the reads). The fastqc report had adapter content that I removed, and checked fastqc again. unable to figure out what am doing wrong as it is very straight forward.

ADD REPLYlink written 10 months ago by Bioinfo_2006140

Did the adapter content vanish after adapter removal? I am not familiar with trimmomatic but you did not specify which adapter to trim (Nextera).

ADD REPLYlink written 10 months ago by ATpoint35k

Think it is an editing issue, it shows up now. I did remove the adapters,d id a fastqc, and do not see it anymore. I redid the whole thing without trimming the reads, and I have the Read flips again with the blue region

ADD REPLYlink written 10 months ago by Bioinfo_2006140

What was the read length distribution after trimmomatic? (FastQC will tell you)

If you pull out some of those "flipped" reads after alignment, what does their sequence/alignment look like?

ADD REPLYlink written 10 months ago by Friederike5.7k

Sequence length is between 25-75 after trimming for adapters because I had specified a minimum length of 25. The adapter is completely gone. Last time when I trimmed it to 36 bp, those flipped reads were gone. Am unable to figure out where it comes from.

ADD REPLYlink written 10 months ago by Bioinfo_2006140

Can you identify the flipped reads in the BAM file? And actually look at their sequence? Are you sure they were actually aligned when you trimmed to 36bp? (Odds are they weren't aligned)

ADD REPLYlink written 10 months ago by Friederike5.7k

Am going to pull out those reads, and also going to check mapping parameters for bowtie2. As Igor mentioned, the fragments could be shorter than the read length, and hence I see the flip whereas once I trim to 36bp I no longer see it.

ADD REPLYlink written 10 months ago by Bioinfo_2006140

If the only issue was that some fragments are shorter than read length, proper adapter and primer trimming should have taken care of that. There's something else going on here -- either the adapter trimming isn't sufficient/comprehensive or...else...

ADD REPLYlink written 10 months ago by Friederike5.7k

I see that the adapter is completely gone, no overrep sequences, and everything else in fastqc is a green.

Was just going through the ENCODE pipeline where they trim the reads to 30 bp and this discussion at What is the reason for trimming reads to 30 bp for ATAC-seq aligning? https://www.encodeproject.org/pipelines/ENCPL035XIO/

ADD REPLYlink written 10 months ago by Bioinfo_2006140

So, what is your alignment rate? (the discussion you just referenced seems to go into the direction of my gut feeling, too: it's not that you're really gaining anything by trimming to 36bp, you're just making sure you're definitely losing the ones with funny behavior.

The "flipped" reads may not be abundant enough to trigger the diagnostics in FastQC. You will have to look at a couple of examples of the flipped reads to see what's going on.

ADD REPLYlink written 10 months ago by Friederike5.7k

the alignment rate is between 97-99% using untrimmed reads, and 100% in terms of trimmed reads

ADD REPLYlink modified 10 months ago • written 10 months ago by Bioinfo_2006140

I also checked for chrM using Samtools idxstats of the sorted bam file, and there are no MT reads present.

ADD REPLYlink written 10 months ago by Bioinfo_2006140

Mapping_stats

Insert_size_metrics

ADD REPLYlink written 10 months ago by Bioinfo_2006140

Thanks for the detailed explanation

ADD REPLYlink written 10 months ago by Bioinfo_2006140
0
gravatar for Bioinfo_2006
10 months ago by
Bioinfo_2006140
Bioinfo_2006140 wrote:

Adding --nocontain option as mapping parameter in Bowtie2 solves the problem (without trimming the reads), however I have an equal percentage of multi-mappers, and unique ones. Would that be an issue?

ADD COMMENTlink written 10 months ago by Bioinfo_2006140

what are the actual percentages? 50-50? 1%?

ADD REPLYlink written 10 months ago by Friederike5.7k

almost 50-50

ADD REPLYlink written 10 months ago by Bioinfo_2006140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 814 users visited in the last hour