bcftools mpileup parallel run in BASH - creating too many files
1
0
Entering edit mode
2.2 years ago
Jamie ▴ 10

Hi, I have been modifying this script and it runs, allowing me to produce the desired output. However, I also get unwanted output as well. For example, the script seems to take the ends of files in the current directory (.txt, .bcf, .bam) and combines them into files ending .txt.bam or .bam.bcf. So the problem is that it creates a large number of excess files which are not needed.

#!/bin/bash

set - eu


genome=/storage1/ref_genome/GRCh38.dxv.fa
export genome


function parallel_call {
    bcftools mpileup \
        --fasta-ref ${genome} \
        --regions $2 \
        --output-type u \
        $1 | \
    bcftools call --multiallelic-caller \
              --variants-only \
              --output-type u - > ${1/.bam/}.$2.bcf
}

export -f parallel_call


function bam_chromosomes {
        samtools idxstats $1 | cut -f 1 | grep -w -f search_file.txt
}

export -f bam_chromosomes



chrom_set=`bam_chromosomes test_file.bam`
parallel --verbose -j 90% parallel_call test_file.bam ::: ${chrom_set}

I have tried specifying to grep the exact contigs I want (chr1,chr2,chr3 ... chr22) but am not sure how to get around this problem. Any ideas would be great. The original code was found at: https://www.danielecook.com/using-gnu-parallel-for-bioinformatics/

BASH BCFtools parallel • 1.3k views
ADD COMMENT
0
Entering edit mode

I'm not sure parallel is the right tool here. While it's powerful you should have look at Make and the workflow managers like snakemake or nextflow

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
2.2 years ago

You are instructing the tool to generate all possible combinations:

parallel echo {} ::: A B C ::: 1 2 3

prints:

A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3

then inside the function, you are saving into each of your combinations. Hence the many files you observe.

So the output is what you instructed it to do. It is not clear from your description what you want. Would you rather have it skip some chromosomes?

If you wish to merge the results then you will need to add another step that does just that with say bcftools concat that concatenates the same samples for different chromosomes. Then you can delete the intermediate files.

PS: note that your invocation seems incorrect, you are including test_bam.bam in both inputs and that probably you meant to write :::: bam_chromosomes to get the input from a file.

ADD COMMENT
0
Entering edit mode

Okay I see, I'll merge only desired results using bcftools concat and remove the intermediate files. Thanks, I'll edit the test_bam.bam inputs.

One of the problems (which is not obvious from here) was the contents of search_file.txt. I had two extra spaces at the end of the file and for some reason, when running the following command:

samtools idxstats test_bam.bam | cut -f 1 | grep -w -f search_file.txt

The output has a * at the end of the list.

Correcting this error and encasing all instances of $1 and $2 in "" seems to have the desired result.

ADD REPLY

Login before adding your answer.

Traffic: 1281 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6