1, filter for your rs ID of interest (must be set in ID field)

Question

Extracting genotypes and filtering from a VCF file containing multiple exomes of families

0

Entering edit mode

5.9 years ago

gradstudentNew ▴ 50

Hi all, I'm currently trying to extract genotypes from a VCF file but I'm not really sure how to approach this. This is how the data line looks like:

12      2224727 .       C       T       6204.77 PASS    0.0010  rs377534958     CACNA1C GT:AD:DP:GQ:PGT:PID:PL:SAC      0/0:5,0:5:12:.:.:0,12,180 0/1:7,3:10:86:.:.:86,0,2
38:0,7,0,3      0/0:44,0:44:99:.:.:0,108,1620   0/0:35,0:35:99:.:.:0,99,1422    0/0:33,0:33:99:.:.:0,99,1262

I'm trying to filter based off GQ and then extract the genotypes of all the individuals with RSID (i.e. rs377534958 0/0 0/1 0/0 0/2) information. Does anyone have any suggestions on how to do this? I've never worked with VCF files containing multiple individuals so this is all very new to me. Thank you so much :)

DNA-seq genotype filter vcf exome • 7.2k views

ADD COMMENT • link updated 5.9 years ago by Pierre Lindenbaum 161k • written 5.9 years ago by gradstudentNew ▴ 50

0

Entering edit mode

Thank you so much for the help! I'm trying to more so get my file to be like the following:

FamilyID IndividualID rs1 rs2 rs3 rs4 rs5 ... rsN fam1 1 0/0 0/1 0/1 0/0 0/0 fam1 2 0/2 0/1 0/1 0/1 0/0

if that makes sense - kind of like your first example but a bit more parallel. The thing is, I have to filter on the GQ before hand to prevent any false positives but since the vcf file contains ~200 individuals and I'd want multiple SNPs, that's what has been really confusing for me. I inherited the data from the previous student and I'm not used to this kind of formatting.

ADD REPLY • link 5.9 years ago by gradstudentNew ▴ 50

0

Entering edit mode

Is Family ID even encoded in your VCF? What exactly are you aiming to do?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

score 1 · Answer 1 · 2018-05-28

Yes, working with multi-sample VCFs can be difficult because most filtering methods only deal with single-sample VCFs. Also, the VCF format is not overly amenable to storing multiple samples. For example, the QUAL field can only hold a single value yet has to represent all of the samples. According to one of my mentors, this is why people then go off and develop custom solutions, as I repeatedly do with VCF data.

I believe what you want may be possible with a program called SnpEff, but here's my own solution:

1, filter for your rs ID of interest (must be set in ID field)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf 

1   762592  rs71507462  C   G   3684.68 VQSRTrancheSNP99.00to99.90;LowQual;SNP_SPECIFIC_FILTERS DB;Dels=0;FS=0;HaplotypeScore=0;MQ=41.99;MQ0=1;NEGATIVE_TRAIN_SITE;QD=27.98;VQSLOD=-3.612;culprit=MQ;set=FilteredInAll;BaseQRankSum=-0.72;MQRankSum=-0.72;ReadPosRankSum=-0.72;InbreedingCoeff=-0.1805;DP=2013;AF=0.81;MLEAC=34;MLEAF=0.81;AN=594;AC=514    GT:AD:DP:GQ:PL  ./.:.:.:.:. 1/1:1,5:6:15:196,15,0   1/1:0,15:15:30:350,30,0 0/1:4,2:6:47:47,0,124   1/1:0,5:5:12:145,12,0   1/1:0,5:5:12:132,12,0   1/1:0,5:5:12:143,12,0   1/1:0,6:6:15:179,15,0   1/1:0,6:6:12:166,12,0   1/1:0,8:8:18:213,18,0   1/1:0,8:8:15:183,15,0   1/1:4,14:18:27:335,27,0 0/1:3,7:10:79:163,0,79  ./.:.:.:.:. 1/1:0,8:8:15:168,15,0   1/1:1,14:15:33:389,33,0 0/1:6,3:9:54:54,0,149   1/1:3,5:8:12:150,12,0   1/1:0,6:6:15:195,15,0   1/1:1,4:5:6:72,6,0  1/1:0,3:3:6:52,6,0  1/1:0,7:7:18:207,18,0   ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 1/1:1,4:5:6:63,6,0  1/1:1,3:4:6:80,6,0

2, 'melt' the the data into long format and ensure that we output the genotype (GT) and the genotype quality (GQ)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]'

1:762592:C:G:16422:./.:.
1:762592:C:G:16427:1/1:15
1:762592:C:G:16451:1/1:30
1:762592:C:G:16456:0/1:47
1:762592:C:G:17412:1/1:12
1:762592:C:G:17413:1/1:12
1:762592:C:G:17417:1/1:12
1:762592:C:G:17418:1/1:15
1:762592:C:G:17422:1/1:12
1:762592:C:G:17423:1/1:18
1:762592:C:G:17427:1/1:15
1:762592:C:G:17428:1/1:27
1:762592:C:G:17432:0/1:79
1:762592:C:G:17433:./.:.
1:762592:C:G:17437:1/1:15
1:762592:C:G:17438:1/1:33

3, use a simple `awk` filter on the 7th column (GQ value) to select those GQs > 10, then tabulate the genotypes

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>10' | cut -f6 -d":" | sort | uniq -c

      4 0/0
     51 0/1
    135 1/1

4, same as #3 but filter for GQ > 30

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f6 -d":" | sort | uniq -c

     33 0/1
      5 1/1

-----------------------------------------------

--------------------------------------------

This script (above) is only valid for a single variant lookup. If your output data for more than one rs ID at the same time, you'll have to modify it. For example, you can 'tally up' (count up) the genotypes of each variant with GQ > 30 across all individuals as follows:

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%ID:%GT:%GQ\n]' | awk -F ":" '$3>30' | cut -f1-2 -d":" | sort | uniq -c

     31 rs140739101:0/0
      1 rs140739101:0/1
     16 rs142004627:0/0
      1 rs142004627:0/1
      1 rs150690004:1/1
     23 rs190717287:0/0
      1 rs190717287:0/1
      1 rs200505207:0/1
      1 rs62639105:0/0
     21 rs62639105:0/1
      4 rs75062661:0/0
      4 rs75062661:0/1
    214 rs75062661:1/1

.

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%ID:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f1-6 -d":" | sort | uniq -c

      1 1:66162:A:T:rs62639105:0/0
     21 1:66162:A:T:rs62639105:0/1
     31 1:69428:T:G:rs140739101:0/0
      1 1:69428:T:G:rs140739101:0/1
     16 1:69453:G:A:rs142004627:0/0
      1 1:69453:G:A:rs142004627:0/1
      1 1:69496:G:A:rs150690004:1/1
      4 1:69511:A:G:rs75062661:0/0
      4 1:69511:A:G:rs75062661:0/1
    214 1:69511:A:G:rs75062661:1/1
     23 1:69534:T:C:rs190717287:0/0
      1 1:69534:T:C:rs190717287:0/1
      1 1:69761:A:T:rs200505207:0/1

----------------------------

Other scripts of mine that you may find of use:

Kevin

score 0 · Answer 2 · 2018-05-29

0

Entering edit mode

5.9 years ago

Pierre Lindenbaum 161k

using vcffilterjdk to select the called variant (not './.') having a GQ greater than 20. http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->!G.isCalled() || G.getGQ()>20);' input.vcf.gz

ADD COMMENT • link 5.9 years ago by Pierre Lindenbaum 161k

1, filter for your rs ID of interest (must be set in ID field)

2, 'melt' the the data into long format and ensure that we output the genotype (GT) and the genotype quality (GQ)

3, use a simple awk filter on the 7th column (GQ value) to select those GQs > 10, then tabulate the genotypes

4, same as #3 but filter for GQ > 30

-----------------------------------------------

--------------------------------------------

----------------------------

3, use a simple `awk` filter on the 7th column (GQ value) to select those GQs > 10, then tabulate the genotypes