How to parallelize bcftools mpileup with GNU parallel?
2
1
Entering edit mode
22 months ago
beausoleilmo ▴ 400

In this thread I saw a line that would run for multiple threads on a computed

parallel 'bcftools mpileup -Ou -f ref.fa -r {} in.bam | bcftools call -m -v -Oz -o {}.vcf.gz' ::: {1..22} X Y


Using the -r flag "-r, --regions chr|chr:pos|chr:from-to|chr:from-[,…]" it is possible to specify a region to look for. Then from there, since my reference genome is not separated in chromosomes (only scaffold level => see the list here)

I get that the command is to divide the task between the 24 chromosomes ({1..22} X Y), but how would it be done if I don't have that for my genome? Also, I'm not sure to understand what the {}.vcf.gz or simply the {} in the output file. Should I put it in the <output.bcf>? Finally, the --jobs flag in parallel will determine the number of scores, but in this case, since it's not flagged, how do I know how many cores it's using (if I have access to 16 cores max)?

This is how I think it should look like, but how do I specify the division of the task (is it better to list the scaffolds (where there are 1000s of them)?). Do I need to put a {} in the output file?

parallel 'bcftools mpileup -C 50 -E -f <ref.genome> --regions {} -b <bam_list>  > <output.bcf>' ::: {SCAFFOLDS??}

bcftools mpileup parallel • 2.8k views
0
Entering edit mode

same comment as previously: C: Running BWA mem in a for loop

3
Entering edit mode
22 months ago
ATpoint 55k

The trick here is to use the random access (so the BAM index) to pull out only the reads you want from the BAM file in order to save time. This code snippet will parallelize the mpileup per scaffold given you have a BED file with the chr-start-end. I did only do a minimal testing and the script looks a bit bulky but should actually work.

The idea is the following: The samtools view will use the region information to only extract the overlapping reads and send them to mpileup. Therefore mpileup only processes exactly these reads, so in your case the contigs one-by-one, but parallelized via GNU parallel. Eventually, the intermediate BCFs are concatenated.

YOUR_BAM="your.bam"
CONTIGS="your_file_with_the_contigs.bed"
NUMBER_OF_JOBS = 8
## Index bam file:
samtools index ${YOUR_BAM} ## A function for bcftools where samtools view extracts only the indicated regions: function MPILEUP { BAM=$1
REGION=$2 samtools view -bu -r${REGION} $BAM \ | bcftools mpileup (options...) -o output_fromBAM_${BAM%.bam}_region_${REGION}.bcf -O b - }; export -f MPILEUP cat$CONTIGS | awk '{print $1":"$2"-"$3}' | parallel -j${NUMBER_OF_JOBS} "MPILEUP ${YOUR_BAM} {}" ## Concatenate everything: bcftools concat -O b output_fromBAM_${YOUR_BAM%.bam}_region_*.bcf > concat_fromBAM_${YOUR_BAM%.bam}.bcf ## remove temporary files: rm output_fromBAM_${YOUR_BAM%.bam}_region_*.bcf


Here NUMBER_OF_JOBS would be 8 so 8 parallel cores working on the task. I would check with top if your harddrive can handle this in terms of I/O bottleneck. This works well on a HPC, not so sure on a workstation with HDD. Also, since mpileup will read BAM from stdinthe sample name in the 10th column of the VCF/BCF will be - so you would need to check if mpielup can handle an external sample name to feed in via the command line, or you have to awk it in manually.

0
Entering edit mode

Hi sorry to re-open a relativley old post but this look s like exactly what I need. How would I alter this script to pass 2 bam files into bcftools mpileup? Thank you so much for your help

0
Entering edit mode

Just a quick note for anyone using this in the future, the -r flag of samtools view specifies the read group (at least in this version of samtools v1.13), not the region, so looks like you will need to use samtools view -bu $BAM${REGION} instead of samtools view -bu -r ${REGION}$BAM.

3
Entering edit mode
22 months ago
beausoleilmo ▴ 400

I think that this would be a solution:

I started with a file that contains the scaffold name and lengths:

JH739887    30495534
JH739888    29527584
JH739889    22321128
[...]


I then added a column with "1" in it to make it the "start" position. Since the scaffolds are not placed on the genome, they all start at one (at least from what I understand)

awk '$1 =$1 FS "1"' chrom.sizes > chrom.start.stop.sizes


Remove all the spaces to create a tab delimited file

tr ' ' '\t' < chrom.start.stop.sizes > chrom.start.stop_tabs.sizes


The file should look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]


Then split (gsplit because I'm on a mac!) the files so that they each contain about 2724 scaffolds (the total number of scaffolds is 27239 so 27239/10 ~2724 so the last file will have 1 scaffold less than all the other files).

gsplit --numeric-suffixes=1 -n l/10 --additional-suffix='.txt' chrom.start.stop_tabs.sizes scaff


The first file should have about 2724 lines which correspond to the scaffolds and look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]


When I used wc -l on the 10 files, I had that:

2501+2618 +2618 +2749 +2792 +2792 +2792 +2793 +2792 +2792  = 27239 total


Then, after the generation of bam files AND the samtools indexing of the bam files, use this code to make it parallel.

parallel -j 15 'bcftools mpileup -Ou -f ref.fa -R scaff$(printf "%02d" {}).txt -b bam_list.txt | bcftools call -m -v -Oz -o myfile{}.vcf.gz' ::: {1..10}  Some explainations: -j 15: use 15 cores -R scaff$(printf "%02d" {}).txt: will be printing the file name with pads of zeros (so it's going to be scaff01.txt, etc.)
- {}: is a placeholder in parallel which is going to be replaced by what is after the ::: which is the {1..10}.


Improvement in the future: I realized that the scaffolds were ordered by size. So the scaffolds at the end of the list will run VERY fast and the one at the top of the list VERY slow. So mixing the scaffolds in the files in gsplit could be nice. In the end, this might not be ideal, but at least, a solution for people who want to experiment with this code!

UPDATE: To shuffle do this;

Before the line:

awk '$1 =$1 FS "1"' chrom.sizes > chrom.start.stop.sizes


shuf chrom.sizes > chrom.start.stop.sizes.shuffled


And change the name of the file accordingly. You could set a seed by looking at this post which I paste the function here:

get_seeded_random()
{
seed="$1" openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
</dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)


Concatenating the files: at the end of the parallel command, there will be 10 different files that need to be put together. Here is a small script that'll do it:

for i in *.vcf.gz
do
echo basename \$i
done | sort -V >> vcf_list.txt

bcftools concat --file-list vcf_list.txt -Ob --output output.bcf