Question: Extracting genotypes and filtering from a VCF file containing multiple exomes of families
0
gravatar for gradstudentNew
2.8 years ago by
gradstudentNew30 wrote:

Hi all, I'm currently trying to extract genotypes from a VCF file but I'm not really sure how to approach this. This is how the data line looks like:

12      2224727 .       C       T       6204.77 PASS    0.0010  rs377534958     CACNA1C GT:AD:DP:GQ:PGT:PID:PL:SAC      0/0:5,0:5:12:.:.:0,12,180 0/1:7,3:10:86:.:.:86,0,2
38:0,7,0,3      0/0:44,0:44:99:.:.:0,108,1620   0/0:35,0:35:99:.:.:0,99,1422    0/0:33,0:33:99:.:.:0,99,1262

I'm trying to filter based off GQ and then extract the genotypes of all the individuals with RSID (i.e. rs377534958 0/0 0/1 0/0 0/2) information. Does anyone have any suggestions on how to do this? I've never worked with VCF files containing multiple individuals so this is all very new to me. Thank you so much :)

genotype dna-seq exome filter vcf • 2.7k views
ADD COMMENTlink modified 2.8 years ago by Pierre Lindenbaum134k • written 2.8 years ago by gradstudentNew30

Thank you so much for the help! I'm trying to more so get my file to be like the following:

FamilyID IndividualID rs1 rs2 rs3 rs4 rs5 ... rsN fam1 1 0/0 0/1 0/1 0/0 0/0 fam1 2 0/2 0/1 0/1 0/1 0/0

if that makes sense - kind of like your first example but a bit more parallel. The thing is, I have to filter on the GQ before hand to prevent any false positives but since the vcf file contains ~200 individuals and I'd want multiple SNPs, that's what has been really confusing for me. I inherited the data from the previous student and I'm not used to this kind of formatting.

ADD REPLYlink written 2.8 years ago by gradstudentNew30

Is Family ID even encoded in your VCF? What exactly are you aiming to do?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Kevin Blighe71k
1
gravatar for Kevin Blighe
2.8 years ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Yes, working with multi-sample VCFs can be difficult because most filtering methods only deal with single-sample VCFs. Also, the VCF format is not overly amenable to storing multiple samples. For example, the QUAL field can only hold a single value yet has to represent all of the samples. According to one of my mentors, this is why people then go off and develop custom solutions, as I repeatedly do with VCF data.

I believe what you want may be possible with a program called SnpEff, but here's my own solution:

1, filter for your rs ID of interest (must be set in ID field)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf 

1   762592  rs71507462  C   G   3684.68 VQSRTrancheSNP99.00to99.90;LowQual;SNP_SPECIFIC_FILTERS DB;Dels=0;FS=0;HaplotypeScore=0;MQ=41.99;MQ0=1;NEGATIVE_TRAIN_SITE;QD=27.98;VQSLOD=-3.612;culprit=MQ;set=FilteredInAll;BaseQRankSum=-0.72;MQRankSum=-0.72;ReadPosRankSum=-0.72;InbreedingCoeff=-0.1805;DP=2013;AF=0.81;MLEAC=34;MLEAF=0.81;AN=594;AC=514    GT:AD:DP:GQ:PL  ./.:.:.:.:. 1/1:1,5:6:15:196,15,0   1/1:0,15:15:30:350,30,0 0/1:4,2:6:47:47,0,124   1/1:0,5:5:12:145,12,0   1/1:0,5:5:12:132,12,0   1/1:0,5:5:12:143,12,0   1/1:0,6:6:15:179,15,0   1/1:0,6:6:12:166,12,0   1/1:0,8:8:18:213,18,0   1/1:0,8:8:15:183,15,0   1/1:4,14:18:27:335,27,0 0/1:3,7:10:79:163,0,79  ./.:.:.:.:. 1/1:0,8:8:15:168,15,0   1/1:1,14:15:33:389,33,0 0/1:6,3:9:54:54,0,149   1/1:3,5:8:12:150,12,0   1/1:0,6:6:15:195,15,0   1/1:1,4:5:6:72,6,0  1/1:0,3:3:6:52,6,0  1/1:0,7:7:18:207,18,0   ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 1/1:1,4:5:6:63,6,0  1/1:1,3:4:6:80,6,0

2, 'melt' the the data into long format and ensure that we output the genotype (GT) and the genotype quality (GQ)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]'

1:762592:C:G:16422:./.:.
1:762592:C:G:16427:1/1:15
1:762592:C:G:16451:1/1:30
1:762592:C:G:16456:0/1:47
1:762592:C:G:17412:1/1:12
1:762592:C:G:17413:1/1:12
1:762592:C:G:17417:1/1:12
1:762592:C:G:17418:1/1:15
1:762592:C:G:17422:1/1:12
1:762592:C:G:17423:1/1:18
1:762592:C:G:17427:1/1:15
1:762592:C:G:17428:1/1:27
1:762592:C:G:17432:0/1:79
1:762592:C:G:17433:./.:.
1:762592:C:G:17437:1/1:15
1:762592:C:G:17438:1/1:33

3, use a simple awk filter on the 7th column (GQ value) to select those GQs > 10, then tabulate the genotypes

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>10' | cut -f6 -d":" | sort | uniq -c

      4 0/0
     51 0/1
    135 1/1

4, same as #3 but filter for GQ > 30

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f6 -d":" | sort | uniq -c

     33 0/1
      5 1/1

-----------------------------------------------

--------------------------------------------

This script (above) is only valid for a single variant lookup. If your output data for more than one rs ID at the same time, you'll have to modify it. For example, you can 'tally up' (count up) the genotypes of each variant with GQ > 30 across all individuals as follows:

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%ID:%GT:%GQ\n]' | awk -F ":" '$3>30' | cut -f1-2 -d":" | sort | uniq -c

     31 rs140739101:0/0
      1 rs140739101:0/1
     16 rs142004627:0/0
      1 rs142004627:0/1
      1 rs150690004:1/1
     23 rs190717287:0/0
      1 rs190717287:0/1
      1 rs200505207:0/1
      1 rs62639105:0/0
     21 rs62639105:0/1
      4 rs75062661:0/0
      4 rs75062661:0/1
    214 rs75062661:1/1

.

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%ID:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f1-6 -d":" | sort | uniq -c

      1 1:66162:A:T:rs62639105:0/0
     21 1:66162:A:T:rs62639105:0/1
     31 1:69428:T:G:rs140739101:0/0
      1 1:69428:T:G:rs140739101:0/1
     16 1:69453:G:A:rs142004627:0/0
      1 1:69453:G:A:rs142004627:0/1
      1 1:69496:G:A:rs150690004:1/1
      4 1:69511:A:G:rs75062661:0/0
      4 1:69511:A:G:rs75062661:0/1
    214 1:69511:A:G:rs75062661:1/1
     23 1:69534:T:C:rs190717287:0/0
      1 1:69534:T:C:rs190717287:0/1
      1 1:69761:A:T:rs200505207:0/1

----------------------------

Other scripts of mine that you may find of use:

Kevin

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Kevin Blighe71k
0
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

using vcffilterjdk to select the called variant (not './.') having a GQ greater than 20. http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->!G.isCalled() || G.getGQ()>20);' input.vcf.gz
ADD COMMENTlink written 2.8 years ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2671 users visited in the last hour
_