Next Generation Sequencing Pipeline
2
1
Entering edit mode
6.9 years ago

Hello everyone,

I am currently programming a python script to automate the next generation sequencing process, with some changeable and default parameters along the execution, for super-enhancer calling but I am facing several problems and doubts.

• Is there any way to accelerate/parallelize MACS peak calling process?
• Does it matter for the usage of bedtools, that the inputs are in BED or BAM format? Can I intersect or substract BED with BAM files and vice versa? I read in the documentation that both type of files are accepted as if they were the same but I think I am missing something...

Any comment on these matters would be appreciated! :)

iGC

ChIP-Seq next-gen sequencing • 1.6k views
2
Entering edit mode
6.9 years ago
DG 7.3k

I don't know about MACs peak calling but yes with bedtools you can mix and match BAMs and BEDs (and VCFs, etc). So if you want all reads from the BAM that fall inside regions in a bed file something like:

bedtools intersect -wa -abam file.bam -b filed.bed > newfile.bam

0
Entering edit mode

Dan, No need of -abam? or input file is auto-detected as BAM/BED?

0
Entering edit mode

You're right it is -abam file.bam as opposed to just -a. I've edited my answer to reflect that

1
Entering edit mode
6.9 years ago
ivivek_ngs ★ 5.2k

Yes you can for the first part of your question I have the answer but in bash scripting which is easier to convert into a python script. You can take a look into Tommy Tangs blog. Quite informative for the MACS parallelization. Check it here.

For the second part yes you can also do the same or just adding a command line to convert bam to bed is not a problem I suppose, so you can then use bedtools or bedops do your desired manipulations.

0
Entering edit mode

I am not sure what OP meant by parallelizing the MACS, it either could be splitting one Macs job to multiple processors, which has to be taken care in the original source code.

0
Entering edit mode

I do not think the OP wants to meddle with the source code of MACS but rather use a python script that can handle large number of peak calling at the cost of less CPU usage and less time instances. So I believe that can be done using different processors attributed to each peak calling. So -xargs or GNU parallels can both be used as put in the blog to maximize the parallelization. That is what I thought and so I answered how to use parallelism with MACS2 since a few years back it was not possible infact with MACS2. I have not however tested with the MACS2.1 though.

0
Entering edit mode

What I really meant with parallelizing MACS process was to split one Macs job to multiple processors. However, my true final goal is to lighten the computation time the most during a 'batch' data processing.

Now, knowing that I can either run multiple MACS jobs at the same time or assing more processors to a single one, what would be better to achieve the fastest processing time for the data batch?

Split each one of the MACS jobs to many processors (up to 12)?
Or run multiple MACS jobs at the same time with less processor assigned to each one of them?

3
Entering edit mode

It is typically easiest to just farm out multiple jobs at the same time if a program is only single-threaded as it doesn't require trying to alter the original program. If you have multiple samples it is easy to do this. If you have a single sample you can often do things like split your bam files by chromosome or by windows of the genome and send those out as independent jobs which you then put back together for a final result at the end (this is called scatter-gather and is how many genomes programs are made parallel internally in the first place). What works best for you isn't only the number of cores you will be processing on, but the maximum RAM usage by a single instance of the program and how much is available on your system. If you only have 16G for instance and each run takes a peak load of 2G then you can only realistically run fewer than 8 samples or instances in parallel. If you have a very high amount of RAM this might not be an issue.

1
Entering edit mode

Yes that is true, even in single thread it might run fast depending upon the RAM usage. I agree to that. having said that it is also useful to provide memory to the script while running with threads for running after estimating the total RAM and CPUs and cores attributed to the system. One can estimate this by running different trials with different parameters for estimating the time run for each of the peak calling. I am using the MACS2.1 using Virtual Mechine in my cluster and I recently ran around 40 peak calling, I did not run them in parallel since we have 24 CPUs with 2 core per CPU so ideally 48 processors and not all are free for only my job so I would estimate that I cannot utilize entire. So I ran in loop for all the the samples with 12 processors for each run with 25G for h_vmem (definitely it is not that much needed since it requires quite less but thats the max I am attributing to be used if the load is higher) and found each run to finish within 15-20 mins. I was mapped reads of around 30-40 millions for each bam files in my case.

0
Entering edit mode

Okey, I think I can manage my issues with the pipeline. Thank you all for the fast responses, they were all very helpful :)

0
Entering edit mode

Am not sure if there is a limit to processors in MACS, in that case you have to dig into the source code or even the paramters if there is some limit. I have never tested both the scenario, but I have worked with running MACS on 40 different samples (or should I say 40 peak calling with a coverage of 30x and 60x and I was pretty happy to see each of the peak calling was finishing within 20 mins). Never made an estimation of the processor dependancy though. You have to first check how many CPUs you have in your cluster and then how many threads for each, and that will allow you to infact use the number of threads for running each job. Running each of them at the same time with parallel processing means you have to not override your CPU memory, so you have to carefully find out how many you can provide for each run distributively. Even allocating one peak calling to multiple processors will speed up and you can actually test on a sample size of 20 and try to run both the test to make a time of completion for your satisfaction. If you have seen through the blog of Tommy Tang it is quite definitely mentioned about how both job processing and parallel processing is taking time.