Question: Peak calling on FastX-collapser processed data
0
gravatar for kevt1999
4.4 years ago by
kevt19990
United States
kevt19990 wrote:

I am processing IP data by first aligning with Bowtie then doing peak calling with MACS. To save CPU cycles, I was told that I should use the FastX-collapser tool ( http://hannonlab.cshl.edu/fastx_toolkit/ ) to remove duplicate reads before feeding my reads into Bowtie. The collaper tool takes fasta entries of the same length and sequence and combine them into a single entry with the occurrence appended to the end of the ID with a "-". For example:
>1
GGAC
>2
GGAC
>3
GGAC
>4
ATCGTTT
Becomes:
>1-3
GGAC
>2-1
ATCGTTT

My question is, does MACS 1.4 ( http://liulab.dfci.harvard.edu/MACS/README.html ) take the "- appended read count" from collapsed data into account? I assume it doesn't and think it needs this info to correctly calculate peak enrichment. However, MACS seems to go though it's own process of removing duplicate reads, suggesting that duplicate reads might not be important after all.

Does anybody know if the read count matters? Do I need to re-expand my data set after Bowtie alignment before feeding it to MACS?
 

chip-seq • 920 views
ADD COMMENTlink modified 4.4 years ago by Ian5.5k • written 4.4 years ago by kevt19990
0
gravatar for Ian
4.4 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

Personally I do not see much to be gained by processing the reads in the way you describe.  MACS/MACS2 does remove redundant reads sharing the same strand and 5' coordinate, however the --keepdup N / auto parameter can allow some level of redundancy, for example, when you have high read coverage and a short genome.  I hope that helped.

ADD COMMENTlink written 4.4 years ago by Ian5.5k

Thank you for the reply!

By "do not see much to be gained", do you mean I should not worry about CPU cycles, and feed the fully duplicated data set though Bowtie and then MACS? I see I have about 1-10 million duplicates for each read.

ADD REPLYlink written 4.4 years ago by kevt19990

That is quite exceptional duplication, is it specific for the protocol?  If it is normal ChIP-seq then something has gone wrong.  Have you checked the reads with fastqc, or the like?

ADD REPLYlink written 4.4 years ago by Ian5.5k

You are right, I am doing an RNA-IP, which involves fragmenting total RNA and pulling down RNA with an antibody targeting methylated RNA. I posted in the ChIP-seq section because my analysis pipeline is closer to ChIP-seq then RNA-seq.

Yes, I'll do fastqc on the data after index trimming.
 

ADD REPLYlink written 4.4 years ago by kevt19990
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 782 users visited in the last hour