Computing B-Allele Frequency From Bam Or Vcf Related To Exome Samples
1
4
Entering edit mode
11.2 years ago
Stephwen ▴ 160

Hi everyone,

I would like to compute BAF profiles for samples whose exome has been sequenced. I thus have the BAM files, and VCF files containing the SNPs, which were called with a GATK-based pipeline.

At first I thought about using the VCF files, and considering that each position which is not present in the VCF has a BAF of 0, and for each position present in the VCF, the BAF was computed as follows:

BAF = coverage Alt / (coverage Alt + coverage Ref)

where coverage Alt is the coverage of the alternate allele(s), and coverage Ref is the coverage of the reference allele(s)

Of course, by doing so, the BAF has a value of zero at most positions.

I was wondering if this method was correct, and if they were other ways to do it.

Thanks

vcf bam exome allele-frequency • 15k views
ADD COMMENT
0
Entering edit mode

I am trying to perform a similar task. I was just wondering if you got two traces approximately symmetrical around 0.5 in your BAF plot. I am roughly getting a single trace for each chromosome (or segment), and have a bit of difficulty interpreting.

ADD REPLY
13
Entering edit mode
11.2 years ago
Irsan ★ 7.8k

Download a bed-file with snp coordinates (because otherwise like you said most coordinate will have BAF=0):

Go to UCSC tables and choose your preferred snp-table (example) and for output format choose BED.

Then make a script called mpileup2baf.pl:

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long qw(:config pass_through no_ignore_case);

my ($min_reads) = (10);
GetOptions (
        "min-reads:s" => \$min_reads,
);

while (<>) {
        my $line = $_;
        my @columns = split("\t",$line);
        my $chr = $columns[0];
        my $start = $columns[1];
        my $end = $start + 1;
        my $num_reads = $columns[3];
        my $calls = $columns[4];
        my $id = "mpileup_number_" . $.;
        if($num_reads < $min_reads){ # not enough coverage to have good confidence in the call
                next;
        }
        my $num_ref = 0;
        while ($calls =~ /[,.]/g) { $num_ref++ }
        my $num_var = $num_reads - $num_ref;
        my $varAlleleFreq = ($num_var/$num_reads)*100;
        print("$chr\t$start\t$end\t$id\t$varAlleleFreq\n");
}

Then use samools mpileup to get all the base calls at the snp-postitions and calculate BAF with mpileup2baf.pl

samtools mpileup -q 15 -Q20 -f hg19.fa -l snp137.bed yourSample.bam | mpileup2baf.pl --min-reads 20 > yourSample_mpileup_snp137_minReads20_minMapQ_15_minBaseQ_20.bed

This skips bases with base quality lower than 20 and bases in reads with mapping quality lower than 15. Furthermore it ignores snp positions where the amout of reads that are mapped is lower than 20.Then plot with any visualization program you like. I prefer ggplot2

ADD COMMENT
1
Entering edit mode

Won't this procedure also skip all variant sites were a BAF can be calculated in your local samples but are not recorded in a SNP database? We see this fairly routinely in our sequencing work.

ADD REPLY
0
Entering edit mode

Good point. Did not think about that, I did this for the first time yesterday :-S Variant detected not in dbSnp should be in as well. But you have to inspect the distribution of BAFs of those variants to see if you dont overload your profile with low BAF values (because 99.9% is BAF VALUES~ 0) or that they have a BAF lower limit (because snp calller only reports variants above certain threshold)

ADD REPLY
0
Entering edit mode

That's true. It depends on how interested you are in very rare variants in the population I suppose. But setting a lower threshold for BAF for some sort of post-hoc filtering after the BAF calculations would be perfectly reasonable I would think.

ADD REPLY
0
Entering edit mode

Thanks for the thorough answer!

ADD REPLY
0
Entering edit mode

No thanks. Also, if you have any problems with getting the BAFs from the bams or visualization issues let me know :-)

ADD REPLY
0
Entering edit mode

I have problems with getting the BAFs from the bams and visualization issues with ggplot2..can you show me the plot code in r? Thanks

ADD REPLY
0
Entering edit mode

@Irsan, I have some queries which I started as a question thread here. I would be happy if you can give me some useful insights where am getting wrong. I would say I have used these bam files for SNP calling and pretty work fine but for calculating B-Allele frequency they fail when am using mpileup on them. Any thoughts?

ADD REPLY

Login before adding your answer.

Traffic: 2800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6