Question: bbmap callvariants - how to add sample names, how to get 0/0 alleles back?
0
gravatar for Philipp Bayer
6 months ago by
Philipp Bayer5.6k
Australia/Perth/UWA
Philipp Bayer5.6k wrote:

I'm running bbmap's callvariants.sh this way for a diploid population of several samples:

callvariants.sh -Xmx20g ploidy=2 nopassdot=t multisample=t in=file1.bam,file2.bam,... vcf=results.vcf

Each bam file represents one individual.

Now I have two problems that I'm fixing manually with a script, but I wonder how to fix this and the help doesn't get me far (p(I'm too tired) < 0.05)

1) My header looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  FORMAT  .....

each FORMAT should be the filename of the bam file, how do I tell callvariants.sh do directly take the filename?

2) I have very few reference alleles, in other words, I have very few 0/0 in my output:

0/0:26:5:0.1923:0.2403:0.9980:20.82:PASS

looks good! Depth is 26, seems to be well supported

 ./.:31:4:0.1290:0.1599:0.9980:15.93:FAIL

why is this missing even though heterozygotes like '0/1:14:5:0.3571:0.4590:0.9980:34.58:PASS' are PASS? For a 38GB vcf file I have 5,201 SNPs with at least one 0/0 but 3,253,627 SNPs with at least one 1/1, that seems imbalanced. (I realise that a large part of these 3 million are garbage, but this is a B. oleracea chromosome, there should be more than 5000 SNPs)

If I set nopassdot=f then the reference alleles and the failing alleles are both labelled 0/0 if I see this correctly. Currently my script replaces ./. cases like the above with a decent depth > 4 by 0/0, but I wonder whether bbmap has an option which I'm not seeing.

(The format string is GT:DP:AD:AF:RAF:SB:SC:PF)

I'm using the standard cutoffs of bbmap:

minreads=2              Ignore variants seen in fewer reads.
minqualitymax=15        Ignore variants with lower max base quality.
minedistmax=20          Ignore variants with lower max distance from read ends.
minmapqmax=0            Ignore variants with lower max mapq.
minidmax=0              Ignore variants with lower max read identity.
minpairingrate=0.1      Ignore variants with lower pairing rate.
minstrandratio=0.1      Ignore variants with lower plus/minus strand ratio.
minquality=12.0         Ignore variants with lower average base quality.
minedist=10.0           Ignore variants with lower average distance from ends.
minavgmapq=0.0          Ignore variants with lower average mapq.
minallelefraction=0.1   Ignore variants with lower allele fraction.  This
                        should be adjusted for high ploidies.
minid=0                 Ignore variants with lower average read identity.
minscore=20.0           Ignore variants with lower Phred-scaled score.

I'm using bbmap v37.90 with java v1.8.0_151

edit: If I run the same files with the bcftools mpileup as described at http://www.htslib.org/workflow/ I get a 0/0, 1/1 ratio that is much much closer to 1:1

bbmap callvariants.sh • 465 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by Philipp Bayer5.6k
2
gravatar for jean.elbers
6 months ago by
jean.elbers340
jean.elbers340 wrote:

I can't comment on the ./. part of your question, but for the naming part, I have used the list= option. Note that you need to do base quality score recalibration then another round of variant calling. Also note that you need to mark duplicates before SNP calling with callvariants.sh

Also, based on prior experience with simulating di-allelic SNPs, I usually keep SNPs or indels that have Phred score of >=Q27 with callvariants.sh following Step 4 below.

Note that I am using bbmap v37.56

1.Call initial variants

ls *-CL-RG-MD.bam > bamlist
~/bin/bbmap-37.56/callvariants.sh list=bamlist out=ALL-samples-raw.vcf \
ref=/work/jelber2/kemps/Cmydas_genome_plus_kemps_mitogenome.fasta ploidy=2 multisample
rm *.vcf.gz

2.Perform quality score recailibration

~/bin/bbmap-37.56/calctruequality.sh \
vcf=ALL-samples-raw.vcf ref=/work/jelber2/kemps/Cmydas_genome_plus_kemps_mitogenome.fasta ploidy=2 \
in=SL266950-CL-RG-MD.bam,SL266957-CL-RG-MD.bam,SL266964-CL-RG-MD.bam,SL266971-CL-RG-MD.bam,\
SL266978-CL-RG-MD.bam,SL266985-CL-RG-MD.bam,SL266992-CL-RG-MD.bam,SL266951-CL-RG-MD.bam,\
SL266958-CL-RG-MD.bam,SL266965-CL-RG-MD.bam,SL266972-CL-RG-MD.bam,SL266979-CL-RG-MD.bam,\
SL266986-CL-RG-MD.bam,SL266993-CL-RG-MD.bam,SL266952-CL-RG-MD.bam,SL266959-CL-RG-MD.bam,\
SL266966-CL-RG-MD.bam,SL266973-CL-RG-MD.bam,SL266980-CL-RG-MD.bam,SL266987-CL-RG-MD.bam,\
SL266994-CL-RG-MD.bam,SL266953-CL-RG-MD.bam,SL266960-CL-RG-MD.bam,SL266967-CL-RG-MD.bam,\
SL266974-CL-RG-MD.bam,SL266981-CL-RG-MD.bam,SL266988-CL-RG-MD.bam,SL266995-CL-RG-MD.bam,\
SL266954-CL-RG-MD.bam,SL266961-CL-RG-MD.bam,SL266968-CL-RG-MD.bam,SL266975-CL-RG-MD.bam,\
SL266982-CL-RG-MD.bam,SL266989-CL-RG-MD.bam,SL266996-CL-RG-MD.bam,SL266955-CL-RG-MD.bam,\
SL266962-CL-RG-MD.bam,SL266969-CL-RG-MD.bam,SL266976-CL-RG-MD.bam,SL266983-CL-RG-MD.bam,\
SL266990-CL-RG-MD.bam,SL266997-CL-RG-MD.bam,SL266956-CL-RG-MD.bam,SL266963-CL-RG-MD.bam,\
SL266970-CL-RG-MD.bam,SL266977-CL-RG-MD.bam,SL266984-CL-RG-MD.bam,SL266991-CL-RG-MD.bam

3.Apply recalibration

while read i;do
    ~/bin/bbmap-37.56/bbduk.sh in=$i out=$i.bam recalibrate
done < bamlist

4.Call variants again

ls *-CL-RG-MD.bam.bam > bamlist2
~/bin/bbmap-37.56/callvariants.sh list=bamlist2 out=ALL-samples-recal-raw.vcf \
ref=/work/jelber2/kemps/Cmydas_genome_plus_kemps_mitogenome.fasta ploidy=2 multisample
ADD COMMENTlink modified 6 months ago • written 6 months ago by jean.elbers340

Thank you for your very detailed answer with commands! I'll give recalibration a try!

ADD REPLYlink written 6 months ago by Philipp Bayer5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 974 users visited in the last hour