How to calculate variant length for each sample
2
0
Entering edit mode
8 months ago
lz1903 • 0

Hi,

I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.

genome • 585 views
1
Entering edit mode
8 months ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

with the file 'biostar.code' (I'm too lazy to do the median depth )

class Stat {
Integer min;
Integer max;
long count;
double sum;
};

final Map<String,Stat> hash = new HashMap<>();
hash.put(sn,new Stat());
}
stream().forEach(V->{
for(final String sn:hash.keySet()) {
final Genotype gt = V.getGenotype(sn);
if(gt.isNoCall() || gt.isHomRef()) continue;
Stat stat = hash.get(sn);
final int len = V.getLengthOnReference();
if(stat.min==null || stat.min.intValue()>len) stat.min = len;
if(stat.max==null || stat.max.intValue()<len) stat.max = len;
stat.sum+=len;
stat.count++;
}
});

for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));
}


usage: (sample / min / max /mean )

java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333

0
Entering edit mode
8 months ago

I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:

wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf

# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt

# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt

# Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1


prints:

2.3435582822086 1   1   81


where size.py is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):

import sys
for line in sys.stdin:
ref, alts = line.strip().split()
for alt in alts.split(","):
size = abs(len(ref) - len(alt))
print (size + 1)


the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)

1
Entering edit mode

I think one way to do this is to list both the REF and ALT

doesn't work when the END is specified in the INFO field (for SV)

0
Entering edit mode

ah yes, looks like I've failed to consider that case.