Parallelized mpileup script
3
4
Entering edit mode
6.4 years ago
Lee Katz ★ 3.1k

Hi, I am wondering if anyone has made a script to make `samtools mpileup` run on one cpu per contig.  I saw this idea on biostars earlier, at A: Best Practices / Faster Samtools Mpileup, Any Tips?, and it works really well.

Adding in the new bcftools call command, this is the basic script:

BAM="yourFile.bam"
REF="reference.fasta"
samtools view -H $BAM | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf $REF -r \"{}\" $BAM | bcftools call -cv > \"{}\".vcf"

I think a script could be useful because there are other things that the one-liner doesn't address like merging the output vcf files and adding in user-generated regions.  In other words, how would the user specify a list of regions and merge them with the embedded contig-level regions?

Combined with all these reasons, I think there is room for a good multithreaded mpileup script.

mpileup parallelized xargs multithreaded • 7.2k views
ADD COMMENT
0
Entering edit mode

Maybe the bcftools/vcf file component is a bit superfluous actually so please ignore it if it helps simplify a script.

ADD REPLY
1
Entering edit mode
6.4 years ago

You can speedup mpilup by cutting your genome into regions of interest

see my demo project https://github.com/lindenb/ngsxml

in this project, I declare regions of interest:

<?xml version="1.0" encoding="UTF-8"?>
<segments>
 <segment chrom="chr4_gl000194_random" start="0" end="191469"/>
 <segment chrom="chr1_gl000192_random" start="0" end="547496"/>
</segments>

here two segments are declared and called in parallel:

$(call  vcf_segment,Proj1,samtools,chr1_gl000192_random,0,547496) : $(call project_dir,Proj1)/BAM/${tmp.prefix}Proj1.bam.list \
        $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bcftools.exe}
        mkdir -p $(dir $@) && \
        ${samtools.exe} mpileup -uf ${REFERENCE} -b $< -r chr1_gl000192_random:0-547496 | \
        ${bcftools.exe} call  --variants-only --multiallelic-caller --output-type z --output $@


$(call  vcf_segment,Proj1,samtools,chr4_gl000194_random,0,191469) : $(call project_dir,Proj1)/BAM/${tmp.prefix}Proj1.bam.list \
        $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bcftools.exe}
        mkdir -p $(dir $@) && \
        ${samtools.exe} mpileup -uf ${REFERENCE} -b $< -r chr4_gl000194_random:0-191469 | \
        ${bcftools.exe} call  --variants-only --multiallelic-caller --output-type z --output $@

we create a list of vcf

$$(call vcf_list,$(1),$(2)) :  \
        $(call  vcf_segment,$(1),$(2),chr4_gl000194_random,0,191469)  \
        $(call  vcf_segment,$(1),$(2),chr1_gl000192_random,0,547496)
        mkdir -p $$(dir $$@)
        rm -f $$(addsuffix .tmp,$$@)
        echo "$(call  vcf_segment,$(1),$(2),chr4_gl000194_random,0,191469)" >> $$(addsuffix .tmp,$$@)
        echo "$(call  vcf_segment,$(1),$(2),chr1_gl000192_random,0,547496)" >> $$(addsuffix .tmp,$$@)
        mv $$(addsuffix .tmp,$$@) $$@

and we merge the VCFs

$$(call vcf_final,$(1),$(2)) : $$(call vcf_list,$(1),$(2)) ${picard.jar}
        mkdir -p $$(dir $$@) && \
        ${java.exe} -jar $$(filter %.jar,$$^) GatherVcfs I=$$< O=$$(addsuffix .tmp.vcf,$$@) && \
        ${bgzip.exe} -f $$(addsuffix .tmp.vcf,$$@) && \
        ${tabix.exe} -f -p vcf $$(addsuffix .tmp.vcf.gz,$$@) && \
        mv $$(addsuffix .tmp.vcf.gz,$$@) $$@ && \
        mv $$(addsuffix .tmp.vcf.gz.tbi,$$@) $$(addsuffix .tbi,$$@)
ADD COMMENT
0
Entering edit mode
6.1 years ago
jgbradley1 ▴ 100

I just want to point readers to new tool, Sambamba that I found, It's essentially a wrapper around Samtools and Bcftools that parallelizes jobs when possible. It's currently in development but I have been happy with the results so far. It's worth checking out. It also parallelizes 'bcftools view' options currently.

ADD COMMENT
0
Entering edit mode
6.0 years ago
sun.nation ▴ 120

Did you figure out how to combine bcf files generated by multi thread mpileup? Multi thread worked for me too and generated bcf files for each chromosome (18) for 30 samples but not able to combine bcf files to single file. I need to combine 18 files together. Thanks SS

ADD COMMENT
0
Entering edit mode

I haven't followed through yet but the answers are very promising here

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6