Question: How to calculate true positive indels?
0
gravatar for tangmanhd
4.0 years ago by
tangmanhd0
tangmanhd0 wrote:

Hi,

I have a  VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And my indel detection method produces its own VCF file (let's call it test.vcf).

To calculate the True Positives, I need detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The True Positives, are the features in test.vcf that are also in ref.vcf. It is easy to understand the definition. But how to code it by hand? Or is there some software or package (in R or matlab)can be used to give the the correct results? Thanks.

 

R genome • 1.5k views
ADD COMMENTlink modified 4.0 years ago by Sean Davis25k • written 4.0 years ago by tangmanhd0

You want to calculate the intersection of two lists. R has a function named intersect().

ADD REPLYlink modified 20 minutes ago by RamRS25k • written 4.0 years ago by karl.stamm3.6k

If you want to classify true/false detection rates, you need to use synthetic data.

ADD REPLYlink written 4.0 years ago by Brian Bushnell17k
2
gravatar for Manuel Landesfeind
4.0 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:
bedtools intersect -f 1.0 -r -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15'

Bedtools intersect

Bedtools allows to control the required overlap size (-f parameter) to adjust your criteria for being a true positive. For example, if you have very long inserts a 90% overlap might be sufficient.

Anyway, look at bcftools norm to make sure that the Indels from both VCF files are represented in the same way. Unfortunately, Indels can often be represented in more than one way, particularly in low-complexity regions. I recently ran into this problem...

EDIT: Change to include REF and ALT comparison between reference and alternative.

landesfeind@bionf-local:/tmp$ cat ref.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ cat test.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GCA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .
chr1    210    .    C    CA    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t&"} $4==$14 && $5 == $15'
chr1    110    .    T    TT    10    PASS    .    .    .    chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15' | wc -l
1
ADD COMMENTlink modified 3 months ago by RamRS25k • written 4.0 years ago by Manuel Landesfeind1.2k
2
gravatar for venu
4.0 years ago by
venu6.3k
Germany
venu6.3k wrote:

If both ref.vcf and test.vcf are in same format, you can use vcftools. In particular you may need vcf-isec

vcf-isec -n +2 ref.vcf.gz test.vcf.gz > TrueIndels.vcf

and before using vcftools, bgzip your vcf files and index with tabix.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 4.0 years ago by venu6.3k
0
gravatar for Tej Sowpati
4.0 years ago by
Tej Sowpati250
India
Tej Sowpati250 wrote:

Hi,

Is it mandatory that you have to use R or MATLAB? If not, have a look at vcftools. You can compare and filter vcf files in several different ways, and one of the most frequent usages is to identify common indels.

Cheers

ADD COMMENTlink written 4.0 years ago by Tej Sowpati250
0
gravatar for Sean Davis
4.0 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

The GRanges and VariantAnnotation Bioconductor packages make this pretty easy in R. Roughly, you will need to:

  1. Use readVcf to read in your two VCF files.
  2. Use findOverlaps(...., type="equal").

Alternatively, simply concatenate the chromosome, start, and end into a string for each variant for each VCF file. Then, use a simple intersect on the two character vectors.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 4.0 years ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1870 users visited in the last hour