Question

Bbtools callvariant multisample mode, + base recalibration

0

Entering edit mode

5 months ago

Axzd ▴ 70

Hello, it's not clear to me if the multisample mode is actually a join calling, or if it's simply the equivalent of bcftools norm + bcftools merge. I am specifically interested in NOT doing join calling. I am also wondering if base recalibration is still necessary? I find tutorials where it's done, but the doc of callvariants doesn't say anything about it. I am aware quality recalibration in itself is not trivial and there are probably pros and cons in doing it.

Sorry the questions are rather general, I hope it's acceptable. I know some people here use bbtools, so I hope to get an answer.

Thanks for any feedback you could give me.

Alex

Bbtools • 1.7k views

ADD COMMENT • link 5 months ago by Axzd ▴ 70

score 4 · Accepted Answer · 2023-11-18

4

Entering edit mode

5 months ago

Brian Bushnell 20k

CallVariants in multisample mode uses 2 passes. First, variants are called on all samples individually. Second, the results are recompiled such that everywhere any individual has a variant, the exact genotype of all individuals is displayed. As such, the pass/fail decision of individual variants in each sample remains the same, but some downstream tools need the genotype of every sample and furthermore, it's often useful to know if a variant did not show up in an individual simply because there was zero coverage. For my purposes I only use it in single-sample mode though.

Quality score recalibration is always a good idea for variant-calling, especially for Illumina data where there are only 3 quality scores on modern platforms. But even on older platforms with the full range the quality scores were pretty inaccurate. Personally I consider it quite easy... you do this once per mapped sample:

#first do preprocessing like adapter trimming, then alignment

calctruequality.sh in=mapped.sam ref=ref.fa callvars ploidy=2

bbduk.sh in=mapped.sam out=recal.sam ordered recalibrate

#now call variants on the recalibrated reads

callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf ploidy=2

If you're unsure whether it's important for your data, I encourage you to run variant-calling on raw and recalibrated reads and compare the VCFs (e.g., comparevcf.sh can do set intersections and subtractions).

ADD COMMENT • link 5 months ago by Brian Bushnell 20k

1

Entering edit mode

Should all aligned files be supplied to calctruequality.sh at the same time? Then each file is recalibrated one at a time?

ADD REPLY • link 5 months ago by GenoMax 141k

0

Entering edit mode

I was about to ask the same question.

ADD REPLY • link 5 months ago by Axzd ▴ 70

0

Entering edit mode

Oh actually from what I understand from the doc, if the sam come from the demultiplexing of the same sequencing it advices to supply everything at once.

ADD REPLY • link 5 months ago by Axzd ▴ 70

0

Entering edit mode

Specifically, if the samples come from the same lane you can run them together with calctruequality.sh, however, in that case, variant-calling will be done by calctruequality on the merged samples unless you supply a VCF. So if you want to run calctruequality on multiple samples at once, it's best to first call variants on them, merge the vcfs (or use callvariants' multisample flag), then supply that vcf to calctruequality instead of giving it the callvars flag. In other words:

#Make an initical VCF for use in recalibration only
callvariants.sh in=a.sam,b.sam,c.sam ref=ref.fa out=multi.vcf multisample

#Generate a recalibration matrix jointly, using the vcf
calctruequality.sh in=a.sam,b.sam,c.sam ref=ref.fa vcf=multi.vcf

#Recalibrate the files individually
bbduk.sh in=a.sam out=a_recal.sam recalibrate
bbduk.sh in=b.sam out=b_recal.sam recalibrate
bbduk.sh in=c.sam out=c_recal.sam recalibrate

#Then call variants again individually on the recalibrated sams...
callvariants.sh in=a_recal.sam ref=ref.fa out=a_recal.vcf
callvariants.sh in=b_recal.sam ref=ref.fa out=b_recal.vcf
callvariants.sh in=c_recal.sam ref=ref.fa out=c_recal.vcf
#or together:
calvariants.sh in=a_recal.sam,b_recal.sam,c_recal.sam ref=ref.fa out=multi_recal.vcf multisample

But I think it's simpler to run calctruequality once per sample.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

variant-calling will be done by calctruequality on the merged samples unless you supply a VCF.

Can you explain a bit more as to what you mean?

In your original answer you were calculating quality first before calling variants.Why has the order of commands changed above? This discussion is timely since it is going to help with something I am doing now.

ADD REPLY • link 5 months ago by GenoMax 141k

1

Entering edit mode

I edited the post's code section for clarity. Calctruequality does not understand lanes or samples, just alignments. So if you give it multiple sam files and set the 'callvars' flag, it will call variations assuming all input reads are from the same sample, which can wipe out some alleles found in only a minority of samples. I may add support for CallVariants' rarity flag to help circumvent this, but for now, if you want to recalibrate quality scores, you should either run calctruequality on one sample at a time (in which case you can use the callvars flag), or else make a vcf prior to calctruequality which contains all the variants for all samples. This initial vcf just tells it which locations to ignore when counting mismatches and can be discarded after running calctruequality; it's just part of the boot-strapping process. That said, it should, hopefully, be very similar to final VCF.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

In the interest of making sure this is done right, if we run calctruequality for multiple samples as separate processes, are these all going to try and update the same matrix file in ref? Should separate dirs be created for each sample run using path=/ref/sample option?

Edit: ~~I ran a couple of additional `calctruequality` jobs on additional samples but they appear to have not touched the matrix files created in `ref/qual` earlier. Jobs completed and wrote logs.~~ I do see the files getting overwritten. Perhaps I was impatient.

ADD REPLY • link 5 months ago by GenoMax 141k

1

Entering edit mode

Every time you run calctruequality, it should overwrite the matrix files in /ref/qual/

I just tested it, and it does overwrite for me... but yes, you can add "path=sample_a" (for both calctruequality and bbduk) and it will write or read the quality matrices from /sample_a/ref/qual/*.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Awesome, thanks Brian! A last question if you don't mind. I know some tools try by themselves to "normalize" variants. I am aware variant normalisation is actually a very complicated concept. What about your tool? Does it "try to normalise" or do you still advice to use bcftools norm? (Or maybe there is a bbtools utility for that)? Thanks so much for your answers. Bbtools is such an awesome suite of utilities, I really love it.

ADD REPLY • link 5 months ago by Axzd ▴ 70

2

Entering edit mode

BBMap already should be producing normalized cigar strings, and CallVariants calls variants directly from the cigar strings with no attempts to impute or alter them; so, if you're calling from reads aligned with a different aligner, you may want to normalize the vcf with another tool; I have not written one. You CAN use the realign flag in CallVariants, though.

You should probably use normalization on the output CallVariants and all other variant callers output in any situation when you are merging them into a single VCF or planning plan to compare them directly to each other.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

bbduk returns an error resulting in a very small output bam (a few ko)

 bbduk.sh in=${p}.slow.sorted.bam out=bases_recalibration/${p}.recal.bam ordered recalibrate ref=../ref.fsa -overwrite  -Xmx40g
java -Djava.library.path=/home/alessandro/software/bbmap/jni/ -ea -Xmx40g -Xms40g -cp /home/alessandro/software/bbmap/current/ jgi.BBDukF in=GC047403.slow.sorted.bam out=bases_recalibration/GC047403.recal.bam ordered recalibrate ref=../ref.fsa -overwrite -Xmx40g
Executing jgi.BBDukF [in=GC047403.slow.sorted.bam, out=bases_recalibration/GC047403.recal.bam, ordered, recalibrate, ref=../ref.fsa, overwrite, -Xmx40g]
Version 38.00 [in=GC047403.slow.sorted.bam, out=bases_recalibration/GC047403.recal.bam, ordered, recalibrate, ref=../ref.fsa, overwrite, -Xmx40g]

Set ORDERED to true
0.024 seconds.
Loading ref/qual/qbpmatrix_p0.txt.gz.
Loading ref/qual/qb123matrix_p0.txt.gz.
Loading ref/qual/qbpmatrix_p1.txt.gz.
Initial:
Memory: max=42949m, total=42949m, free=42823m, used=126m

Added 96946868 kmers; time:     9.025 seconds.
Memory: max=42949m, total=42949m, free=37111m, used=5838m

Found sambamba.

sambamba 1.0.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2023
    LDC 1.32.0 / DMD v2.102.2 / LLVM14.0.6 / bootstrap LDC - the LLVM D compiler (1.32.0)

Found samtools 1.18
Waiting on header to be read from a sam file.
Started output streams: 0.026 seconds.
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag and no ScafMap loaded.
This can normally be resolved by adding the flag ref=file, where file is the fasta file to which the reads were mapped.

SN571:1749:HLCGCBCXY:2:2214:6533:100652 1:N:0:CCGTCC    147 Chrom_3 9427    46  106=1M144=  =   9193    -485    AATAATAACAAACATTGAACAGCAATACGATGGGAAACATGTTCAAAACTATACATTTGGTGTTAATGGAAAAAAATTGGAGGTCATGATATTAATATGAATAGATNTCGTACAAATATATATATATAGGATTAATGTGCCAACAAAAGTGGTAGACGTCCTTTTGTACGACACCCCTTACATATATGTCTCAATAATATTTGATCTGCTATCATATCATTTCCTTTTTCTTGTTCTTTTTCTTTATCTCT GCIHFEHHEHHAEHC?HHHIIHF?HGHG6-E@@77HEEHGHD-IHFFCIHHHIIHGAHHGCHFFHCHFHHHIGIIGIIGHEHIIGIIIIIIHIHHHFHIFIHHD<<!DIHIHIIIHIHIIGHIIIIIIIIIIIIIIHHIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIHIHHIIIIIIIHIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD NM:i:1  AM:i:46
    at stream.SamLine.toShortMatch(SamLine.java:1214)
    at stream.SamLine.toRead(SamLine.java:2017)
    at stream.SamLine.toRead(SamLine.java:1877)
    at stream.SamReadInputStream.toReadList(SamReadInputStream.java:118)
    at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:89)
    at stream.SamReadInputStream.nextList(SamReadInputStream.java:73)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:677)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:653)
Processing time:        0.212 seconds.

Input:                      10400 reads         2602448 bases.
Contaminants:               9963 reads (95.80%)     2493185 bases (95.80%)
Total Removed:              9963 reads (95.80%)     2493185 bases (95.80%)
Result:                     437 reads (4.20%)   109263 bases (4.20%)

Time:                           9.266 seconds.
Reads Processed:       10400    1.12k reads/sec
Bases Processed:       2602k    0.28m bases/sec
Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
    at jgi.BBDukF.process(BBDukF.java:924)
    at jgi.BBDukF.main(BBDukF.java:69)

I tried by adding the ref flag as suggested but it doesn't help.

The reads were mapped with bbmap.sh.

Any idea? Thanks

EDIT: I tried this solution calctruequality in bbmap

no more error message but the bam file still shrinks, from 22 Gb to 70 Mo.

ADD REPLY • link 5 months ago by Axzd ▴ 70

1

Entering edit mode

The key is this error message:

java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag and no ScafMap loaded.
This can normally be resolved by adding the flag ref=file, where file is the fasta file to which the reads were mapped.

I'll see if I can fix this quickly... it looks like it should not be occurring. Probably it would go away if MD tags were present in the reads. You can add the BBMap flag "mdtag" during mapping; I'll probably make that the default.

By the way, what version of BBTools are you using?

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for your answer. I can share the files with you if this helps. It's the version 38.00

EDIT: trying the 39.05, will report back by tomorrow if it works.

ADD REPLY • link 5 months ago by Axzd ▴ 70

1

Entering edit mode

Just checked, and it works fine for me. This is an old bug that was fixed a while ago, so 39.05 should work fine for you too. Also, you don't need to add a "ref" flag for bbduk in this case; that will just waste time and memory.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

With 39.05, no bug. However, the bam files still shrink, for example, from 22 Go to 1,6 Go. If I understand well, bbduk is also used to remove reads that are contaminants, could it be there is a bug and instead of recalibrating the bases, it's actually ditching reads? Here is the output from the console from one of my bam:

Input:                      65536978 reads      15652702978 bases.
Contaminants:               54462219 reads (83.10%)     12983423910 bases (82.95%)
Total Removed:              54462219 reads (83.10%)     12983423910 bases (82.95%)
Result:                     11074759 reads (16.90%)     2669279068 bases (17.05%)

Why is it counting "Contaminants"? I know bbduk can be used for contaminant filtering, but I understand that in recalibration mode, it just rewrites the input bam with the recalibrated base score. Or is it a "two in one"? And if this is the case, is there a way to toggle off contaminant filtering?

ADD REPLY • link 5 months ago by Axzd ▴ 70

2

Entering edit mode

Right, it's evicting all reads that match the reference. You need to remove the "ref=../ref.fsa" flag. If there is a reference, the default behavior of BBDuk is to send everything matching the reference to "outu" and everything not matching the reference to "out".

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks so much Brian. Your software suite is fantastic; the more I use it, the more I love it. It's incredible all the problems you figured out and how to solve them. I have a very last question: in multisample mode in callvariants.sh, how is the QUAL score reported? Is it simply the highest? You don't recompute it?

I am sorry for how my questions are spread out over several days. I have some personal health issues preventing me from dedicating myself fully to my job here. Thanks again Brian.

ADD REPLY • link 5 months ago by Axzd ▴ 70

2

Entering edit mode

The qual score is the max of all of the sample qual scores, and the pass filter is pass if any sample passes.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Oh ok, so mmm, I will work file by file since I want to be sure for every file. I have to know if sample A passes but sample B doesn't. Thanks, Brian

ADD REPLY • link 5 months ago by Axzd ▴ 70