How to calculate variant length for each sample
2
0
Entering edit mode
3.1 years ago
lz1903 ▴ 10

Hi,

I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.

Thanks in advance, Leilei

genome • 1.6k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

with the file 'biostar.code' (I'm too lazy to do the median depth )

class Stat {
    Integer min;
    Integer max;
    long count;
    double sum;
    };

final Map<String,Stat> hash = new HashMap<>();
for(final String sn: header.getSampleNamesInOrder()) {
    hash.put(sn,new Stat());
}
stream().forEach(V->{
    for(final String sn:hash.keySet()) {
        final Genotype gt = V.getGenotype(sn);
        if(gt.isNoCall() || gt.isHomRef()) continue;
        Stat stat = hash.get(sn);
        final int len = V.getLengthOnReference();
        if(stat.min==null || stat.min.intValue()>len) stat.min = len;
        if(stat.max==null || stat.max.intValue()<len) stat.max = len;
        stat.sum+=len;
        stat.count++;
        }
    });

for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));
}

usage: (sample / min / max /mean )

java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf 
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333
ADD COMMENT
0
Entering edit mode
3.1 years ago

I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:

wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf

# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt

# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt

 # Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1

prints:

2.3435582822086 1   1   81

where size.py is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):

import sys
for line in sys.stdin:
    ref, alts = line.strip().split()
    for alt in alts.split(","):
        size = abs(len(ref) - len(alt))
        print (size + 1)

the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)

ADD COMMENT
1
Entering edit mode

I think one way to do this is to list both the REF and ALT

doesn't work when the END is specified in the INFO field (for SV)

ADD REPLY
0
Entering edit mode

ah yes, looks like I've failed to consider that case.

ADD REPLY

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6