bcftool (mpileup, call) extremely slow on 60 BAMs (WGS 20x)
1
1
Entering edit mode
5.4 years ago
serpalma.v ▴ 80

I am trying to run bcftools (v1.9) mpileup on 60 BAM files from WGS (~20x).

bcftools mpileup -Ou -f ref.fa $(echo $BAM_LIST) | bcftools call -vmO z -o out.vcf.gz

The job started 20 days ago. The out.vcf.gz shows that it has gone through 10 chr (~2ds/chr).

There are still 9 autosomes, 2 sex chrs, MT chr and 44 scaffolds to go through, which I estimate will take another 20 days to complete.

The log shows only the following (not updated since day of submission):

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 60 samples in 60 input files

However, the out.vcf.gz is being written on daily.

I am looking for a way to make this process to go faster, say, within 2 weeks instead of 40 days.

I found a related solution here but it's for one bam at a time and I would like to call variants on all 60 samples at once.

Any help will be appreciated!

bcftools samtools mpileup WGS snp • 3.5k views
ADD COMMENT
1
Entering edit mode
5.4 years ago

split your analysis per chromosome/contig using the option --regions-file and concatenate the vcfs later.

ADD COMMENT
0
Entering edit mode

I created a test tab delimited file regions.txt for chr1 and chr2 that looks as follows:

1       1   195471971
2       1   182113224

Column 1 corresponds to CHROM, column2 to POS and column3 to POS_TO. According to the corresponding documentation:

The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive.

For POS_TO I used the length of the chromosome.

Then I call bcftools and it is now running:

bcftools mpileup -Ou -f ref.fa samp1.bam samp2.bam --regions-file regions.txt | bcftools call -vmO z -o out.vcf.gz

Could you please tell me if the regions.txt was properly constructed in order to span fully each chromosome?

Thanks!

ADD REPLY
0
Entering edit mode

no

a bed is 0-based and the start position should start with '0', not '1'

furthermore the idea is to launch 'X' jobs for one chromosome in parallel , not to run one job for 'X' chromosomes.

ADD REPLY
0
Entering edit mode

As I understand it, in order to be cosidered a bed file, the bed extension should be specified:

To indicate that a file be treated as BED rather than the 1-based tab-delimited file, the file must have the ".bed" or ".bed.gz" suffix (case-insensitive).

Otherwise, the --regions-file is treated as tab-delimited file by default:

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default).

In which case positions should start with '1':

The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive.

As for your second point, I am not really sure what you suggest. Do you mean I should provide one chromosome at a time and run the job in parallel?

ADD REPLY
0
Entering edit mode

As for your second point, I am not really sure what you suggest. Do you mean I should provide one chromosome at a time and run the job in parallel?

yes

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6