Question: filter multisample vcf based on per-sample average read depth
0
gravatar for agathejouet
2.9 years ago by
agathejouet0 wrote:

Hi,

I have a multisample vcf that I would like to filter based on read depth. More precisely, I would like to modify the vcf so that a snp that is detected where the read depth is more than the average read depth at the sample will change to the reference allele. I know how to filter snps where read depth is higher than a certain value with

vcftools --maxDP

However, the value applied to the maxDP parameter would change depending on the sample.

Is there any tool or trick that could help me do this?

Many thanks for your help,

Agathe

ADD COMMENTlink modified 2.9 years ago by Pierre Lindenbaum131k • written 2.9 years ago by agathejouet0

How do you plan to handle if the samples are discordant (i.e. the SNP has more reads than the avg read depth of one sample, but not another?) Are you just going to change the genotype of each sample individually? What's your ultimate goal here?

ADD REPLYlink written 2.9 years ago by jared.andrews077.5k

That is exactly my problem. The average read depth is different for each sample so I would need to find a way to modify the genotypes of each sample individually. My ultimate goal is to have a VCF file with all samples, where SNPs have been filtered for a minimum and a maximum read depth. I would use this VCF to build phylogenies, run population structure analyses, calculating heterozygosity...

ADD REPLYlink written 2.9 years ago by agathejouet0

Oof, that's an annoying one. You could write a script that uses bedtools genomecov to get the read-depth at each variant position for each sample, but it's going to be rather annoying as you'll have to get that information from the bam file for each sample while iterating through the VCF.

I guess another (really not ideal) alternative would be to go back and call variants on each sample individually, filter with vcftools as you mention above, and then merge all the samples into a single VCF with bcftools merge or GATK's CombineVariants tools.

Is there a particular reason you feel filtering in this way will help your analyses? Most variant callers do a fair amount of filtering or offer separate utilities that can also be used for filtering.

ADD REPLYlink written 2.9 years ago by jared.andrews077.5k

Yes, I thought of re-doing the variant calling for each individual separately and merging the resulting vcf files. The problem with this is that samples that were homozygous for the reference alleles will end up with missing genotypes if other individuals were variant. A way around this would be to print all positions (variants & non variants) to get a gvcf but you end up with quite large files.

I feel I need to do this filtering because I am mapping reads to loci from a gene family with lots of paralogs. I fear that if paralogs are not in my reference, they will map to the same location/locus, falsely inflating the number of SNPs.

I will try Pierre's solution and will let you know if it works for me! Working on this now.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by agathejouet0

Ah okay. Yes, Pierre's solution should work for you assuming your FORMAT fields include DP. Good luck.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by jared.andrews077.5k

Could you post some example VCF records?

ADD REPLYlink written 2.9 years ago by cpad011214k

Not sure this is what you want but my multisample vcf looks like this at the moment:

LOC_Os01g02250_chr1_688792..693013_UTR-0        1422    .       A       G       999     .       .       GT:PL:DP        1/1:245,247,0:82        1/1:217,211,0:70        1/1:242,255,0:144       1/1:234,234,0:84        1/1:238,226,0:75        1/1:255,255,0:99        1/1:238,255,0:94        1/1:242,189,0:69        1/1:243,235,0:78        1/1:248,255,0:95        1/1:241,193,0:64        1/1:248,190,0:63        1/1:255,255,0:121       1/1:247,193,0:64        1/1:249,232,0:77        1/1:211,202,0:68        1/1:229,160,0:53        1/1:230,244,0:81        1/1:248,172,0:57        1/1:198,136,0:56        1/1:234,223,0:74        0/0:0,255,255:248       0/0:0,255,255:248       0/0:0,255,214:237       1/1:255,255,0:93        1/1:255,255,0:213       0/0:0,255,255:249       1/1:200,51,0:17 1/1:251,232,0:79        1/1:219,255,0:86        0/0:0,255,255:235
   1/1:255,255,0:106       1/1:246,255,0:101       0/0:0,255,255:235       0/0:0,255,233:239
LOC_Os01g02250_chr1_688792..693013_UTR-0        1522    .       G       T       999     .       .       GT:PL:DP        1/1:158,79,0:32 1/1:147,64,0:26 1/1:174,96,0:32 1/1:160,96,0:37 1/1:120,48,0:16 1/1:158,105,0:35        1/1:136,81,0:27 1/1:141,84,0:28 1/1:138,69,0:23 1/1:146,90,0:30 1/1:115,54,0:18 1/1:133,48,0:16 1/1:141,69,0:36 1/1:150,72,0:24 1/1:157,96,0:32 1/1:162,72,0:24 1/1:142,39,0:13 1/1:134,27,0:26 ./.:93,24,6:7   1/1:166,117,0:39        1/1:122,87,0:29 0/0:0,255,255:247       0/0:0,255,255:249       0/0:0,255,253:243       1/1:207,160,0:53        0/0:0,255,255:224       0/0:0,255,255:231       ./.:0,3,39:1    1/1:182,53,0:22 1/1:120,31,0:14 0/0:0,255,247:227
   1/1:182,119,0:45        1/1:173,63,0:21 0/0:0,255,255:241       0/0:0,255,129:234
LOC_Os01g02250_chr1_688792..693013_UTR-0        1524    .       C       T       999     .       .       GT:PL:DP        1/1:171,93,0:31 1/1:152,70,0:28 1/1:172,96,0:32 1/1:165,108,0:36        1/1:130,48,0:16 1/1:157,88,0:35 1/1:151,81,0:27 1/1:137,84,0:28 1/1:144,69,0:23 1/1:151,90,0:30 1/1:138,60,0:20 1/1:135,54,0:18 1/1:155,96,0:37 1/1:125,66,0:22 1/1:155,99,0:33 1/1:166,72,0:24 1/1:129,36,0:12 1/1:150,38,0:26 ./.:117,21,0:7  1/1:172,123,0:41        1/1:131,76,0:30 0/0:0,255,255:247       0/0:0,255,255:245       0/0:0,255,249:245       1/1:209,160,0:53        0/0:0,255,255:231       0/0:0,255,255:233       ./.:0,3,35:1    1/1:184,53,0:22 1/1:123,31,0:14 0/0:0,255,251:229
   1/1:154,82,0:45 1/1:174,63,0:21 0/0:0,255,255:240       0/0:0,255,135:234
LOC_Os01g02250_chr1_688792..693013_UTR-0        1532    .       C       T       187     .       .       GT:PL:DP        0/0:0,105,157:35        0/0:0,93,172:31 0/0:0,114,202:39        0/0:0,129,205:43        0/0:0,66,171:22 0/0:0,132,206:44        0/0:0,105,153:35        0/0:0,117,183:39        0/0:0,78,140:26 0/0:0,117,183:39        0/0:0,70,162:28 0/0:0,63,123:21 0/0:0,126,182:42        0/0:0,69,138:23 0/0:0,120,193:40        0/0:0,81,195:27 0/0:0,42,154:14 0/0:0,87,193:29 0/0:0,30,123:10 0/0:0,148,179:50
    0/0:0,102,165:34        0/0:0,255,255:245       0/0:0,255,255:248       0/0:0,168,212:244       0/0:0,160,209:53        0/0:0,255,255:230       0/0:0,255,255:241
   ./.:0,3,31:1    0/0:0,63,200:21 0/0:0,69,147:23 0/1:49,0,184:228        0/0:0,141,213:47        0/0:0,60,175:20 0/1:194,0,255:240       0/0:0,255,110:236
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by agathejouet0
0
gravatar for Pierre Lindenbaum
2.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html and the following script, assuming the genotypes have a DP attribute.

VariantContextBuilder vcb=new VariantContextBuilder(variant);
vcb.rmAttribute("AC");
vcb.rmAttribute("AF");
final double avgdp = variant.getGenotypes().stream().filter(G->G.hasDP()).mapToInt(G->G.getDP()).average().getAsDouble();
vcb.genotypes(variant.getGenotypes().
    stream().
    map(G->{
        if(!G.isCalled()) return G;
        if(!G.hasDP()) return G;
        if((double)G.getDP()> avgdp) return G;
        final List<Allele> aL=new ArrayList<>();
        while(aL.size()<G.getPloidy()) aL.add(variant.getReference());
        return new GenotypeBuilder(G).alleles(aL).make();
        }).
    collect(Collectors.toList())
    );
return vcb.make();

example:

curl "ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/variation/main/vcf/all/ag1000g.phase2.ar1.2L.vcf.gz" | gunzip -c |\
java -jar  dist/vcffilterjdk.jar -f script.js
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Pierre Lindenbaum131k

For those who want to try the example, please note that example gz (gzipped) file is 142GB.

ADD REPLYlink written 2.9 years ago by cpad011214k

Can I have a bit more information on how your script works Pierre, please? From what I can guess, it changes alternate alleles to reference alleles when read depth at the SNP is larger than double the average read depth across samples. Is that right?

However what I would need is to filter according to the average read depth within each sample.

Thanks!

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by agathejouet0

I understand the problem this way: DP in INFO field is average read depth for that SNP (feature). DP in format field is unique to sample. If sample DP (DP in format field) is more than the average DP (in info field) for that SNP, then ref allele should replace alternate allele. Is my understanding correct?

ADD REPLYlink written 2.9 years ago by cpad011214k

First sentence is correct. However, I would like to compare the sample DP (in format field) to the average read depth of the sample overall. What would be good would be to have one file with the average read depth for all samples, e.g.:

sample_ID     avg_read_depth
sample 1          100
sample 2          200
sample 3          150

And then, for each variant of sample 1, ref allele should replace alternate allele if sample 1 DP (in field format) is bigger than, say 1.5100 = 150. For sample 2, > 1.5200 = 300...

Is this clearer?

Sorry for the misunderstanding, trying my best to explain.

ADD REPLYlink written 2.9 years ago by agathejouet0

Thanks for the explanation.

Since headers were missing from examples provided above, I worked with example vcf provided here: (https://github.com/vcflib/vcflib/blob/master/samples/scaffold612.vcf). This sample vcf has 47 samples and note that column 1 values (samples) can be cleaned up further But your sample names may vary.

$ bcftools query  -H -f'[%DP\t]\n' sample.vcf  | sed 's/\./0/g' | datamash -H  mean 1-47 | datamash transpose | datamash --full round 2 | cut -f1,3

First few lines (with two columns- first column= sample, second column=mean):

mean(#[1]I205_FC80E1NABXX_L6_PIGtkyRDIDIAAPEI-7:DP) 12
mean([2]I205_FC80E1NABXX_L7_PIGtkyRDJDIAAPEI-11:DP) 9
mean([3]I232_FCB05KDABXX_L4_COLcalRANDIAAPE:DP) 13
mean([4]I232_FCB05KDABXX_L5_COLcalRAMDIAAPE:DP) 13
mean([5]I246_FCB05BAABXX_L5_COLcalRAGDIAAPE:DP) 14
mean([6]I246_FCB05BAABXX_L7_COLcalRASDIAAPE:DP) 13
mean([7]I248_FCB068VABXX_L2_COLcalRBCDIAAPE:DP) 6
mean([8]I248_FCB068VABXX_L3_COLcalRAGDIAAPE:DP) 9
mean([9]I248_FCB068VABXX_L4_COLcalRAHDIAAPE:DP) 12
mean([10]I248_FCB068VABXX_L5_COLcalRAIDIAAPE:DP)    10

samething: a little bit slow with csvkit:

 $  bcftools query  -H -f'[%DP\t]\n' sample.vcf  |awk -v OFS="\t" '{sub(/ /,//);gsub(/\./,0);{$1=$1}}1' | csvstat --mean
  1. #1[1]I205_FC80E1NABXX_L6_PIGtkyRDIDIAAPEI-7:DP: 12.216
  2. [2]I205_FC80E1NABXX_L7_PIGtkyRDJDIAAPEI-11:DP: 9.278
  3. [3]I232_FCB05KDABXX_L4_COLcalRANDIAAPE:DP: 12.879
  4. [4]I232_FCB05KDABXX_L5_COLcalRAMDIAAPE:DP: 12.602
  5. [5]I246_FCB05BAABXX_L5_COLcalRAGDIAAPE:DP: 13.691
  6. [6]I246_FCB05BAABXX_L7_COLcalRASDIAAPE:DP: 12.614
  7. [7]I248_FCB068VABXX_L2_COLcalRBCDIAAPE:DP: 6.06
  8. [8]I248_FCB068VABXX_L3_COLcalRAGDIAAPE:DP: 9.479
  9. [9]I248_FCB068VABXX_L4_COLcalRAHDIAAPE:DP: 11.524
 10. [10]I248_FCB068VABXX_L5_COLcalRAIDIAAPE:DP: 9.69
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by cpad011214k

that is what the script is doing here:

inal double avgdp = variant.getGenotypes().stream().filter(G->G.hasDP()).mapToInt(G->G.getDP()).average().getAsDouble();

which can be translated as:

for the current variant, loop over the genotypes having a "DP" field and get the average value of DP.

ADD REPLYlink written 2.9 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1549 users visited in the last hour