filter multisample vcf based on per-sample average read depth
1
1
Entering edit mode
4.7 years ago
agathejouet ▴ 10

Hi,

I have a multisample vcf that I would like to filter based on read depth. More precisely, I would like to modify the vcf so that a snp that is detected where the read depth is more than the average read depth at the sample will change to the reference allele. I know how to filter snps where read depth is higher than a certain value with

vcftools --maxDP


However, the value applied to the maxDP parameter would change depending on the sample.

Is there any tool or trick that could help me do this?

Agathe

vcf filter variant call read depth • 4.0k views
0
Entering edit mode

How do you plan to handle if the samples are discordant (i.e. the SNP has more reads than the avg read depth of one sample, but not another?) Are you just going to change the genotype of each sample individually? What's your ultimate goal here?

0
Entering edit mode

That is exactly my problem. The average read depth is different for each sample so I would need to find a way to modify the genotypes of each sample individually. My ultimate goal is to have a VCF file with all samples, where SNPs have been filtered for a minimum and a maximum read depth. I would use this VCF to build phylogenies, run population structure analyses, calculating heterozygosity...

0
Entering edit mode

Oof, that's an annoying one. You could write a script that uses bedtools genomecov to get the read-depth at each variant position for each sample, but it's going to be rather annoying as you'll have to get that information from the bam file for each sample while iterating through the VCF.

I guess another (really not ideal) alternative would be to go back and call variants on each sample individually, filter with vcftools as you mention above, and then merge all the samples into a single VCF with bcftools merge or GATK's CombineVariants tools.

Is there a particular reason you feel filtering in this way will help your analyses? Most variant callers do a fair amount of filtering or offer separate utilities that can also be used for filtering.

0
Entering edit mode

Yes, I thought of re-doing the variant calling for each individual separately and merging the resulting vcf files. The problem with this is that samples that were homozygous for the reference alleles will end up with missing genotypes if other individuals were variant. A way around this would be to print all positions (variants & non variants) to get a gvcf but you end up with quite large files.

I feel I need to do this filtering because I am mapping reads to loci from a gene family with lots of paralogs. I fear that if paralogs are not in my reference, they will map to the same location/locus, falsely inflating the number of SNPs.

I will try Pierre's solution and will let you know if it works for me! Working on this now.

0
Entering edit mode

Ah okay. Yes, Pierre's solution should work for you assuming your FORMAT fields include DP. Good luck.

0
Entering edit mode

Could you post some example VCF records?

0
Entering edit mode

Not sure this is what you want but my multisample vcf looks like this at the moment:

LOC_Os01g02250_chr1_688792..693013_UTR-0        1422    .       A       G       999     .       .       GT:PL:DP        1/1:245,247,0:82        1/1:217,211,0:70        1/1:242,255,0:144       1/1:234,234,0:84        1/1:238,226,0:75        1/1:255,255,0:99        1/1:238,255,0:94        1/1:242,189,0:69        1/1:243,235,0:78        1/1:248,255,0:95        1/1:241,193,0:64        1/1:248,190,0:63        1/1:255,255,0:121       1/1:247,193,0:64        1/1:249,232,0:77        1/1:211,202,0:68        1/1:229,160,0:53        1/1:230,244,0:81        1/1:248,172,0:57        1/1:198,136,0:56        1/1:234,223,0:74        0/0:0,255,255:248       0/0:0,255,255:248       0/0:0,255,214:237       1/1:255,255,0:93        1/1:255,255,0:213       0/0:0,255,255:249       1/1:200,51,0:17 1/1:251,232,0:79        1/1:219,255,0:86        0/0:0,255,255:235
1/1:255,255,0:106       1/1:246,255,0:101       0/0:0,255,255:235       0/0:0,255,233:239
LOC_Os01g02250_chr1_688792..693013_UTR-0        1522    .       G       T       999     .       .       GT:PL:DP        1/1:158,79,0:32 1/1:147,64,0:26 1/1:174,96,0:32 1/1:160,96,0:37 1/1:120,48,0:16 1/1:158,105,0:35        1/1:136,81,0:27 1/1:141,84,0:28 1/1:138,69,0:23 1/1:146,90,0:30 1/1:115,54,0:18 1/1:133,48,0:16 1/1:141,69,0:36 1/1:150,72,0:24 1/1:157,96,0:32 1/1:162,72,0:24 1/1:142,39,0:13 1/1:134,27,0:26 ./.:93,24,6:7   1/1:166,117,0:39        1/1:122,87,0:29 0/0:0,255,255:247       0/0:0,255,255:249       0/0:0,255,253:243       1/1:207,160,0:53        0/0:0,255,255:224       0/0:0,255,255:231       ./.:0,3,39:1    1/1:182,53,0:22 1/1:120,31,0:14 0/0:0,255,247:227
1/1:182,119,0:45        1/1:173,63,0:21 0/0:0,255,255:241       0/0:0,255,129:234
LOC_Os01g02250_chr1_688792..693013_UTR-0        1524    .       C       T       999     .       .       GT:PL:DP        1/1:171,93,0:31 1/1:152,70,0:28 1/1:172,96,0:32 1/1:165,108,0:36        1/1:130,48,0:16 1/1:157,88,0:35 1/1:151,81,0:27 1/1:137,84,0:28 1/1:144,69,0:23 1/1:151,90,0:30 1/1:138,60,0:20 1/1:135,54,0:18 1/1:155,96,0:37 1/1:125,66,0:22 1/1:155,99,0:33 1/1:166,72,0:24 1/1:129,36,0:12 1/1:150,38,0:26 ./.:117,21,0:7  1/1:172,123,0:41        1/1:131,76,0:30 0/0:0,255,255:247       0/0:0,255,255:245       0/0:0,255,249:245       1/1:209,160,0:53        0/0:0,255,255:231       0/0:0,255,255:233       ./.:0,3,35:1    1/1:184,53,0:22 1/1:123,31,0:14 0/0:0,255,251:229
1/1:154,82,0:45 1/1:174,63,0:21 0/0:0,255,255:240       0/0:0,255,135:234
LOC_Os01g02250_chr1_688792..693013_UTR-0        1532    .       C       T       187     .       .       GT:PL:DP        0/0:0,105,157:35        0/0:0,93,172:31 0/0:0,114,202:39        0/0:0,129,205:43        0/0:0,66,171:22 0/0:0,132,206:44        0/0:0,105,153:35        0/0:0,117,183:39        0/0:0,78,140:26 0/0:0,117,183:39        0/0:0,70,162:28 0/0:0,63,123:21 0/0:0,126,182:42        0/0:0,69,138:23 0/0:0,120,193:40        0/0:0,81,195:27 0/0:0,42,154:14 0/0:0,87,193:29 0/0:0,30,123:10 0/0:0,148,179:50
0/0:0,102,165:34        0/0:0,255,255:245       0/0:0,255,255:248       0/0:0,168,212:244       0/0:0,160,209:53        0/0:0,255,255:230       0/0:0,255,255:241
./.:0,3,31:1    0/0:0,63,200:21 0/0:0,69,147:23 0/1:49,0,184:228        0/0:0,141,213:47        0/0:0,60,175:20 0/1:194,0,255:240       0/0:0,255,110:236

0
Entering edit mode
4.7 years ago

Using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html and the following script, assuming the genotypes have a DP attribute.

VariantContextBuilder vcb=new VariantContextBuilder(variant);
vcb.rmAttribute("AC");
vcb.rmAttribute("AF");
final double avgdp = variant.getGenotypes().stream().filter(G->G.hasDP()).mapToInt(G->G.getDP()).average().getAsDouble();
vcb.genotypes(variant.getGenotypes().
stream().
map(G->{
if(!G.isCalled()) return G;
if(!G.hasDP()) return G;
if((double)G.getDP()> avgdp) return G;
final List<Allele> aL=new ArrayList<>();
return new GenotypeBuilder(G).alleles(aL).make();
}).
collect(Collectors.toList())
);
return vcb.make();


example:

curl "ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/variation/main/vcf/all/ag1000g.phase2.ar1.2L.vcf.gz" | gunzip -c |\
java -jar  dist/vcffilterjdk.jar -f script.js

0
Entering edit mode

For those who want to try the example, please note that example gz (gzipped) file is 142GB.

0
Entering edit mode

Can I have a bit more information on how your script works Pierre, please? From what I can guess, it changes alternate alleles to reference alleles when read depth at the SNP is larger than double the average read depth across samples. Is that right?

However what I would need is to filter according to the average read depth within each sample.

Thanks!

0
Entering edit mode

I understand the problem this way: DP in INFO field is average read depth for that SNP (feature). DP in format field is unique to sample. If sample DP (DP in format field) is more than the average DP (in info field) for that SNP, then ref allele should replace alternate allele. Is my understanding correct?

0
Entering edit mode

First sentence is correct. However, I would like to compare the sample DP (in format field) to the average read depth of the sample overall. What would be good would be to have one file with the average read depth for all samples, e.g.:

sample_ID     avg_read_depth
sample 1          100
sample 2          200
sample 3          150


And then, for each variant of sample 1, ref allele should replace alternate allele if sample 1 DP (in field format) is bigger than, say 1.5100 = 150. For sample 2, > 1.5200 = 300...

Is this clearer?

Sorry for the misunderstanding, trying my best to explain.

0
Entering edit mode

Thanks for the explanation.

Since headers were missing from examples provided above, I worked with example vcf provided here: (https://github.com/vcflib/vcflib/blob/master/samples/scaffold612.vcf). This sample vcf has 47 samples and note that column 1 values (samples) can be cleaned up further But your sample names may vary.

$bcftools query -H -f'[%DP\t]\n' sample.vcf | sed 's/\./0/g' | datamash -H mean 1-47 | datamash transpose | datamash --full round 2 | cut -f1,3  First few lines (with two columns- first column= sample, second column=mean): mean(#[1]I205_FC80E1NABXX_L6_PIGtkyRDIDIAAPEI-7:DP) 12 mean([2]I205_FC80E1NABXX_L7_PIGtkyRDJDIAAPEI-11:DP) 9 mean([3]I232_FCB05KDABXX_L4_COLcalRANDIAAPE:DP) 13 mean([4]I232_FCB05KDABXX_L5_COLcalRAMDIAAPE:DP) 13 mean([5]I246_FCB05BAABXX_L5_COLcalRAGDIAAPE:DP) 14 mean([6]I246_FCB05BAABXX_L7_COLcalRASDIAAPE:DP) 13 mean([7]I248_FCB068VABXX_L2_COLcalRBCDIAAPE:DP) 6 mean([8]I248_FCB068VABXX_L3_COLcalRAGDIAAPE:DP) 9 mean([9]I248_FCB068VABXX_L4_COLcalRAHDIAAPE:DP) 12 mean([10]I248_FCB068VABXX_L5_COLcalRAIDIAAPE:DP) 10  samething: a little bit slow with csvkit: $  bcftools query  -H -f'[%DP\t]\n' sample.vcf  |awk -v OFS="\t" '{sub(/ /,//);gsub(/\./,0);{$1=$1}}1' | csvstat --mean
1. #1[1]I205_FC80E1NABXX_L6_PIGtkyRDIDIAAPEI-7:DP: 12.216
2. [2]I205_FC80E1NABXX_L7_PIGtkyRDJDIAAPEI-11:DP: 9.278
3. [3]I232_FCB05KDABXX_L4_COLcalRANDIAAPE:DP: 12.879
4. [4]I232_FCB05KDABXX_L5_COLcalRAMDIAAPE:DP: 12.602
5. [5]I246_FCB05BAABXX_L5_COLcalRAGDIAAPE:DP: 13.691
6. [6]I246_FCB05BAABXX_L7_COLcalRASDIAAPE:DP: 12.614
7. [7]I248_FCB068VABXX_L2_COLcalRBCDIAAPE:DP: 6.06
8. [8]I248_FCB068VABXX_L3_COLcalRAGDIAAPE:DP: 9.479
9. [9]I248_FCB068VABXX_L4_COLcalRAHDIAAPE:DP: 11.524
10. [10]I248_FCB068VABXX_L5_COLcalRAIDIAAPE:DP: 9.69

0
Entering edit mode

that is what the script is doing here:

inal double avgdp = variant.getGenotypes().stream().filter(G->G.hasDP()).mapToInt(G->G.getDP()).average().getAsDouble();


which can be translated as:

for the current variant, loop over the genotypes having a "DP" field and get the average value of DP.