I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.
I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:
wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf
# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt
# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt
# Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1
prints:
2.3435582822086 1 1 81
where size.py is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):
import sys
for line in sys.stdin:
ref, alts = line.strip().split()
for alt in alts.split(","):
size = abs(len(ref) - len(alt))
print (size + 1)
the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)
doesn't work when the
END
is specified in the INFO field (for SV)ah yes, looks like I've failed to consider that case.