Question: bcftool (mpileup, call) extremely slow on 60 BAMs (WGS 20x)
0
gravatar for serpalma.v
4 months ago by
serpalma.v20
Germany
serpalma.v20 wrote:

I am trying to run bcftools (v1.9) mpileup on 60 BAM files from WGS (~20x).

bcftools mpileup -Ou -f ref.fa $(echo $BAM_LIST) | bcftools call -vmO z -o out.vcf.gz

The job started 20 days ago. The out.vcf.gz shows that it has gone through 10 chr (~2ds/chr).

There are still 9 autosomes, 2 sex chrs, MT chr and 44 scaffolds to go through, which I estimate will take another 20 days to complete.

The log shows only the following (not updated since day of submission):

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 60 samples in 60 input files

However, the out.vcf.gz is being written on daily.

I am looking for a way to make this process to go faster, say, within 2 weeks instead of 40 days.

I found a related solution here but it's for one bam at a time and I would like to call variants on all 60 samples at once.

Any help will be appreciated!

wgs snp mpileup samtools bcftools • 217 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by serpalma.v20
1
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

split your analysis per chromosome/contig using the option --regions-file and concatenate the vcfs later.

ADD COMMENTlink written 4 months ago by Pierre Lindenbaum118k

I created a test tab delimited file regions.txt for chr1 and chr2 that looks as follows:

1       1   195471971
2       1   182113224

Column 1 corresponds to CHROM, column2 to POS and column3 to POS_TO. According to the corresponding documentation:

The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive.

For POS_TO I used the length of the chromosome.

Then I call bcftools and it is now running:

bcftools mpileup -Ou -f ref.fa samp1.bam samp2.bam --regions-file regions.txt | bcftools call -vmO z -o out.vcf.gz

Could you please tell me if the regions.txt was properly constructed in order to span fully each chromosome?

Thanks!

ADD REPLYlink written 4 months ago by serpalma.v20

no

a bed is 0-based and the start position should start with '0', not '1'

furthermore the idea is to launch 'X' jobs for one chromosome in parallel , not to run one job for 'X' chromosomes.

ADD REPLYlink written 4 months ago by Pierre Lindenbaum118k

As I understand it, in order to be cosidered a bed file, the bed extension should be specified:

To indicate that a file be treated as BED rather than the 1-based tab-delimited file, the file must have the ".bed" or ".bed.gz" suffix (case-insensitive).

Otherwise, the --regions-file is treated as tab-delimited file by default:

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default).

In which case positions should start with '1':

The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive.

As for your second point, I am not really sure what you suggest. Do you mean I should provide one chromosome at a time and run the job in parallel?

ADD REPLYlink written 4 months ago by serpalma.v20

As for your second point, I am not really sure what you suggest. Do you mean I should provide one chromosome at a time and run the job in parallel?

yes

ADD REPLYlink written 4 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1479 users visited in the last hour