Question

extracting Allele Read Counts

0

Entering edit mode

7.1 years ago

Bogdan ★ 1.4k

Dear all,

please could you advise : given a (tumor) BAM file and a (germline) VCF file, what tool shall i use in order to extract the Allele Read Counts for each heterozygous SNP (from the vcf file) ?

many thanks,

-- bogdan

SNP Titan CNV • 3.6k views

ADD COMMENT • link updated 7.1 years ago by Pierre Lindenbaum 161k • written 7.1 years ago by Bogdan ★ 1.4k

1

Entering edit mode

Depending on the variant caller you are using. For some of them you will see the following two fields under FORMAT column:

##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">

ADD REPLY • link 7.1 years ago by iraun 6.2k

0

Entering edit mode

The easiest way to do this is to extract the sites using bcftools view -v snps -g het. You can add -m2 -M2 if you are only interested in biallelic site. Then you can use allelecounter.py (https://github.com/secastel/allelecounter) to extract the read counts at ref and alt allele.

ADD REPLY • link 6.7 years ago by prasundutta87 ▴ 660

0

Entering edit mode

thank you for your suggestions ;)

ADD REPLY • link 6.7 years ago by Bogdan ★ 1.4k

score 3 · Accepted Answer · 2017-04-05

a combination of some programs I've written.

Extract the heterozygous sites with vcffilterjs.
use awk to extract the 'CHROM:POS' of those site.
for each position , call FindAllCoverageAtPosition with the specified bam
sort + uniq to get one uniq header.

    java -jar jvarkit-git/dist/vcffilterjs.jar -e 'variant.getGenotype(0).isHet()' input.vcf | grep -v "^#" |\
awk '{printf("%s:%d\n",$1,$2);}' | while read  S; do echo "input.bam" | java -jar /jvarkit-git/dist/findallcoverageatposition.jar -p "${S}" ; done | \
LC_ALL=C sort | uniq | column -t


#File   CHROM      POS  SAMPLE  DEPTH  M    I   D  N  S   H  P  EQ  X  Base(A)  Base(C)  Base(G)  Base(T)  Base(N)  Base(^)  Base(-)
S1.bam  rotavirus  130  S1      328    328  0   0  0  52  0  0  0   0  33       0        0        295      0        0        0
S1.bam  rotavirus  232  S1      363    363  35  0  0  56  0  0  0   0  25       0        0        338      0        35       0
S1.bam  rotavirus  267  S1      315    315  1   1  0  44  0  0  0   0  0        296      19       0        0        1        1
S1.bam  rotavirus  961  S1      327    327  0   0  0  45  0  0  0   0  0        0        29       298      0        0        0

(later) pushed a fresh version on github a few minutes ago. there was a problem in the old findallcoverageatposition (all files were 'cram')