Genotyping with samtools/bcftools
1
0
Entering edit mode
5.8 years ago
SOHAIL ▴ 400

Hi everyone,

I have BAM files of some ancient genomes (for example: 17X), authors mentioned in the paper that they genotyped all genomes individually using samtools (v1.3.1) mpileup –C50 and bcftools (v1.3.1) using the consensus caller. Briefly, the steps are as follows:

  1. Run samtools mpileup with the -C parameter set to 50 and only included sites with a depth of coverage greater than 1/3 the average depth and lower than 2 times the average depth.
  2. Filtered out variant calls with a phred posterior probability lower than 30.
  3. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.
  4. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

My initial command-line is:

samtools mpileup -C50 -uf ref.fa aln.bam | bcftools call -cO v -o study.vcf

I have little idea to design the simple samtools command-line with bcftools piping consensus caller but don't know how to apply filters.. can anyone please help me in command-line design.

Thanks a lot!

-sohail

samtools bcftools ngs • 4.8k views
ADD COMMENT
2
Entering edit mode
5.8 years ago

Using the recent versions of SAMtools and BCFtools, you can do it like this:

1, call variants

samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF --output-tags DP,AD -f "${Ref_FASTA}" \
--BCF BAM1.bam BAM2.bam BAM3.bam BAM4.bam | \
bcftools call --multiallelic-caller --variants-only -Ob > Out.bcf ;

Note that here I have 4 replicates, which samtools mpileup accepts.

For less stringent calling (outputs pretty much anything that can possibly be reported, then pipe it into:

bcftools call --consensus-caller --variants-only --pval-threshold 1.0 -Ob

2, split multi-alleles (1st pipe) and left-align indels (2nd pipe)

bcftools norm -m-any Out.bcf | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > Out.norm.bcf ;
bcftools index Out.norm.bcf ;

3, add all available tags to the BCF

See my answer here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

4, further filter the variants

#Options:
#   -Q INT  minimum RMS mapping quality for SNPs [10]
#   -d INT  minimum read depth [2]
#   -D INT  maximum read depth [10000000]
#   -a INT  minimum number of alternate bases [2]
#   -w INT  SNP within INT bp around a gap to be filtered [3]
#   -W INT  window size for filtering adjacent gaps [10]
#   -1 FLOAT    min P-value for strand bias (given PV4) [0.0001]
#   -2 FLOAT    min P-value for baseQ bias [1e-100]
#   -3 FLOAT    min P-value for mapQ bias [0]
#   -4 FLOAT    min P-value for end distance bias [0.0001]
#   -e FLOAT    min P-value for HWE (plus F<0) [0.0001]
#   -p      print filtered variants
echo "Applying further filtering to called variants..." ;
bcftools view Out.norm.bcf | \
misc/vcfutils.pl varFilter -d "${ReadDepthCutoff} \
-w 1 -W 3 -a 1 -1 0.05 -2 0.05 -3 0.05 -4 0.05 -e 0.05 -p \
> Out.norm.filt.vcf  ;

echo "Variants filtered out:"

Note that, in the final part (#4), I use a perl executable (vcfutils.pl) that comes bundled with BCFtools in the misc directory.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks @kevin for your kind response..

it's really helpful. however, i have some confusions. The steps that you mentioned, have you tested while analyzing ancient genomes data?

After reading your guidelines i designed my command-line (based on the criteria that i mentioned in question previously) for ancient BAM samples as follows:

STEP1: I am interested in calling all sites variants/non-variant omitting --variants-only option in BCFTOOLS Since i am using single sample BAM omitting --per-sample-mF

samtools mpileup -C50 --redo-BAQ --min-BQ 30 --output-tags DP,AD -f ref.fa \
 --BCF aln.bam | bcftools call --consensus-caller -Ob > Out.bcf

STEP2 and STEP3 as you suggested.

what does this -m-any used for in step 2 and what is --pval-threshold 1.0 in STEP 1 ?

In STEP4 i am still not sure how apply the following filters:

2. Filtered out variant calls with a phred posterior probability lower than 30.
3. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.
4. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

Why you set most of the options 0.05 in vcfutils.pl command? can you please guide me in applying above mentioned filters? or if you have tested your set parameters performed better in ancient DNA analysis, please share your experience.

Thanks a lot!

ADD REPLY
0
Entering edit mode

Hey SOHAIL, this has not been tested on ancient DNA, however, it has been tested rigorously in a clinical genetics laboratory, so, on varying tissue types and of varying qualities. It is tailoured for detecting germline calls, but it can still identify somatic and low frequency variants (like one may expect from circulating free DNA, FFPE DNA, or ancient DNA) when the parameters are altered.

In STEP 2 / 3, -m-any left-aligns indels and splits multi-allelic calls. --pval-threshold is used with --consensus-caller and is, to put it simply, the probabiltiy threshold in order to break the null hypothesis that the alternate allele is the reference allele. When it is set to 1.0, the threshold is relaxed, therefore, more variants will be called (but these will be less reliable calls). Sometimes this is necessary, and I have detected even Sanger-confirmed variants this way.

For STEP4,

  1. Filtered out variant calls with a phred posterior probability lower than 30.

I believe that this is mentioned here: SAMtools/BCFtools specific information (please \read through the various flags)

  1. Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.

-1 FLOAT min P-value for strand bias (given PV4) [0.0001]

  1. Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.

You can just do this with BCFtools (bcftools view, I believe), by filtering on the AF (allele frequency) tag.

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6