Question: Genotyping with samtools/bcftools
0
gravatar for SOHAIL
17 months ago by
SOHAIL300
Beijing Institute of Genomics, CAS.
SOHAIL300 wrote:

Hi everyone,

I have BAM files of some ancient genomes (for example: 17X), authors mentioned in the paper that they genotyped all genomes individually using samtools (v1.3.1) mpileup –C50 and bcftools (v1.3.1) using the consensus caller. Briefly, the steps are as follows:

  1. Run samtools mpileup with the -C parameter set to 50 and only included sites with a depth of coverage greater than 1/3 the average depth and lower than 2 times the average depth.
  2. Filtered out variant calls with a phred posterior probability lower than 30.
  3. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.
  4. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

My initial command-line is:

samtools mpileup -C50 -uf ref.fa aln.bam | bcftools call -cO v -o study.vcf

I have little idea to design the simple samtools command-line with bcftools piping consensus caller but don't know how to apply filters.. can anyone please help me in command-line design.

Thanks a lot!

-sohail

bcftools ngs samtools • 1.6k views
ADD COMMENTlink modified 17 months ago by Kevin Blighe52k • written 17 months ago by SOHAIL300
2
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

Using the recent versions of SAMtools and BCFtools, you can do it like this:

1, call variants

samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF --output-tags DP,AD -f "${Ref_FASTA}" \
--BCF BAM1.bam BAM2.bam BAM3.bam BAM4.bam | \
bcftools call --multiallelic-caller --variants-only -Ob > Out.bcf ;

Note that here I have 4 replicates, which samtools mpileup accepts.

For less stringent calling (outputs pretty much anything that can possibly be reported, then pipe it into:

bcftools call --consensus-caller --variants-only --pval-threshold 1.0 -Ob

2, split multi-alleles (1st pipe) and left-align indels (2nd pipe)

bcftools norm -m-any Out.bcf | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > Out.norm.bcf ;
bcftools index Out.norm.bcf ;

3, add all available tags to the BCF

See my answer here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

4, further filter the variants

#Options:
#   -Q INT  minimum RMS mapping quality for SNPs [10]
#   -d INT  minimum read depth [2]
#   -D INT  maximum read depth [10000000]
#   -a INT  minimum number of alternate bases [2]
#   -w INT  SNP within INT bp around a gap to be filtered [3]
#   -W INT  window size for filtering adjacent gaps [10]
#   -1 FLOAT    min P-value for strand bias (given PV4) [0.0001]
#   -2 FLOAT    min P-value for baseQ bias [1e-100]
#   -3 FLOAT    min P-value for mapQ bias [0]
#   -4 FLOAT    min P-value for end distance bias [0.0001]
#   -e FLOAT    min P-value for HWE (plus F<0) [0.0001]
#   -p      print filtered variants
echo "Applying further filtering to called variants..." ;
bcftools view Out.norm.bcf | \
misc/vcfutils.pl varFilter -d "${ReadDepthCutoff} \
-w 1 -W 3 -a 1 -1 0.05 -2 0.05 -3 0.05 -4 0.05 -e 0.05 -p \
> Out.norm.filt.vcf  ;

echo "Variants filtered out:"

Note that, in the final part (#4), I use a perl executable (vcfutils.pl) that comes bundled with BCFtools in the misc directory.

Kevin

ADD COMMENTlink modified 15 months ago • written 17 months ago by Kevin Blighe52k

Thanks @kevin for your kind response..

it's really helpful. however, i have some confusions. The steps that you mentioned, have you tested while analyzing ancient genomes data?

After reading your guidelines i designed my command-line (based on the criteria that i mentioned in question previously) for ancient BAM samples as follows:

STEP1: I am interested in calling all sites variants/non-variant omitting --variants-only option in BCFTOOLS Since i am using single sample BAM omitting --per-sample-mF

samtools mpileup -C50 --redo-BAQ --min-BQ 30 --output-tags DP,AD -f ref.fa \
 --BCF aln.bam | bcftools call --consensus-caller -Ob > Out.bcf

STEP2 and STEP3 as you suggested.

what does this -m-any used for in step 2 and what is --pval-threshold 1.0 in STEP 1 ?

In STEP4 i am still not sure how apply the following filters:

2. Filtered out variant calls with a phred posterior probability lower than 30.
3. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.
4. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

Why you set most of the options 0.05 in vcfutils.pl command? can you please guide me in applying above mentioned filters? or if you have tested your set parameters performed better in ancient DNA analysis, please share your experience.

Thanks a lot!

ADD REPLYlink modified 17 months ago • written 17 months ago by SOHAIL300

Hey SOHAIL, this has not been tested on ancient DNA, however, it has been tested rigorously in a clinical genetics laboratory, so, on varying tissue types and of varying qualities. It is tailoured for detecting germline calls, but it can still identify somatic and low frequency variants (like one may expect from circulating free DNA, FFPE DNA, or ancient DNA) when the parameters are altered.

In STEP 2 / 3, -m-any left-aligns indels and splits multi-allelic calls. --pval-threshold is used with --consensus-caller and is, to put it simply, the probabiltiy threshold in order to break the null hypothesis that the alternate allele is the reference allele. When it is set to 1.0, the threshold is relaxed, therefore, more variants will be called (but these will be less reliable calls). Sometimes this is necessary, and I have detected even Sanger-confirmed variants this way.

For STEP4,

  1. Filtered out variant calls with a phred posterior probability lower than 30.

I believe that this is mentioned here: SAMtools/BCFtools specific information (please \read through the various flags)

  1. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.

-1 FLOAT min P-value for strand bias (given PV4) [0.0001]

  1. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

You can just do this with BCFtools (bcftools view, I believe), by filtering on the AF (allele frequency) tag.

Thanks!

ADD REPLYlink written 17 months ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2305 users visited in the last hour