Hi,
I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.
Thanks in advance, Leilei
Hi,
I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.
Thanks in advance, Leilei
using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
with the file 'biostar.code' (I'm too lazy to do the median depth )
class Stat {
Integer min;
Integer max;
long count;
double sum;
};
final Map<String,Stat> hash = new HashMap<>();
for(final String sn: header.getSampleNamesInOrder()) {
hash.put(sn,new Stat());
}
stream().forEach(V->{
for(final String sn:hash.keySet()) {
final Genotype gt = V.getGenotype(sn);
if(gt.isNoCall() || gt.isHomRef()) continue;
Stat stat = hash.get(sn);
final int len = V.getLengthOnReference();
if(stat.min==null || stat.min.intValue()>len) stat.min = len;
if(stat.max==null || stat.max.intValue()<len) stat.max = len;
stat.sum+=len;
stat.count++;
}
});
for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));
}
usage: (sample / min / max /mean )
java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333
I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:
wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf
# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt
# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt
# Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1
prints:
2.3435582822086 1 1 81
where size.py is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):
import sys
for line in sys.stdin:
ref, alts = line.strip().split()
for alt in alts.split(","):
size = abs(len(ref) - len(alt))
print (size + 1)
the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
doesn't work when the
ENDis specified in the INFO field (for SV)ah yes, looks like I've failed to consider that case.