Question: indel preprocessing prior to bam-readcount
gravatar for Jokhe
4.2 years ago by
Jokhe120 wrote:

What kind of preprocessing do you need to use bam-readcount and for indels?

This question has been already asked couple of times but so far I haven't found any clear answer (scripts or other examples) for them. So, I have a set of VCF files generated by VarScan2. These VCF files contain somatic indels occuring in tumor samples. I am trying to filter these indels by using bam-readcount and The format of VCF file is following;

 chr12  46236017    .   CA  C   .   PASS    DP=177;SS=1;SSC=0;GPV=6.0385E-14;SPV=8.3811E-1  GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:66:44:17:27.87%:44,0,17,0 0/1:.:111:80:23:22.33%:80,0,23,0

My script is presented below;

# Prepare input file by selecting data from VCF input file.
sed '/^#/ d' $file | awk '{print $1,$2,$2}' > $file.positions

# Calculate read features by using bam-readcount tool.
$BAM_READCOUNT -w1 -f $REFERENCE ${file%.vcf}.bam -l $file.positions > $file.readcounts
rm $file.positions

# Filter variants by using
perl $FPFILTER --var-file $file --readcount-file $file.readcounts --output-file $file.output
rm $file.readcounts

 # Select variants that pass all filters.
grep "PASS" $file.output | awk '{print $1,$2-2,$2+1}' > $file.pass
rm $file.output

# Filter original VCF file so that only variants passing all filters are included.
vcftools --vcf $file --positions $file.pass --recode --out ${file%.vcf}.filtered.vcf
rm *.pass
rm *.log
rm $file

However, this approad filters all indels from my data. The problem seems to be in readcounts, because the prompt indicates that;

Filtering $file for false-positives...
4 variants
0 had a position near the ends of most supporting reads (position < 0.1)
0 had strandedness < 0.01 (most supporting reads are in the same direction)
0 had var_count < 3 (not enough supporting reads)
0 had depth < 8
0 passed all filters
4 had no readcounts for the variant allele

I would really appreciate tips and example scripts to see how raw variant data from VCF should be extracted and used in bam-readcount and fpfilter.

ADD COMMENTlink modified 4.1 years ago by m.b0 • written 4.2 years ago by Jokhe120

chr12 46236017 . CA C

for bam-readcount, you need to supply chr12 46236018 46236018 as the position

ADD REPLYlink written 3.1 years ago by Ming Tang2.6k

I have the same problem, and apparently there is no reliable answer out there by now. How did you solve the problem for your data?

ADD REPLYlink written 3.2 years ago by ATpoint41k
gravatar for m.b
4.1 years ago by
m.b0 wrote:

I have been facing a similar problem. I see that the main issue for your filtering process is: "no readcounts for the variant allele". Can you please tell me if your 4 variants are all deletions?

I have a larger dataset with insertions and deletions; and all my deletions have not passed the filtering because there is no read count for them. This is a bam-readcount problem.

I went back to the bam-readcount github page and it seems that this is a recognised bug. It does not return readcounts at positions of known deletions. Please see link below:

Unfortunately, I am not sure how to get around this....

ADD COMMENTlink written 4.1 years ago by m.b0

Thank you for your answer. The example case I am presenting in this post contains only deletions so it is very likely that this is caused by the bug you indicated. However, I have similar problem with all of my samples and some of them do contain also insertions. Insertions are also filtered out due to "no readcounts".

ADD REPLYlink written 4.1 years ago by Jokhe120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1228 users visited in the last hour