Question: Extracting allele, Genotype from VCF file
0
gravatar for sukhjindermultani85
10 months ago by
sukhjindermultani850 wrote:

How to extract allel, Genotype from vcf file using python or other language for 23GB files? Well, I am able to right script to get allel but for large VCF file its difficult ? what should other possible way to get allel, Genotype information?

tool genotype allele vcf • 1.6k views
ADD COMMENTlink modified 9 months ago by finswimmer11k • written 10 months ago by sukhjindermultani850
1

try bcftools query .

ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k

how about VCFtools?

ADD REPLYlink modified 10 months ago • written 10 months ago by FatihSarigol120

Why is this a tool post? A question about tools should be a question-type post, not a tool-type post.

ADD REPLYlink written 10 months ago by RamRS20k

What have you tried?

ADD REPLYlink written 10 months ago by RamRS20k

May help the user (AWK ideas):

Actually, I have a Python script that can parse a VCF, in fact: Filtering VCF with python

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

Why have you replied to my comment, Kevin?

ADD REPLYlink written 10 months ago by RamRS20k

Did not want to create yet another 4th and independent comment

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

You can take a look at this two scripts wrote in python to split a vcf and select what you want : A: VCF file help and C: parsing vcf file

ADD REPLYlink modified 10 months ago • written 10 months ago by Bastien HervĂ©3.7k
1
gravatar for arup
9 months ago by
arup870
India
arup870 wrote:

Extracting genotype information using R.

library(vcfR)
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf, element = c('GT'), as.numeric = TRUE)

For python take a look at the following article.

http://alimanfoo.github.io/2017/06/14/read-vcf.html

Genotypes can also be extracted using SnpSift.jar in snpEff using the following command.

java -jar ../snpEff/SnpSift.jar extractFields annotated.vcf   CHROM POS REF ALT  "GEN[*].GT" > output.tsv
ADD COMMENTlink modified 4 months ago • written 9 months ago by arup870
1

Doesn't look like vcfR does streaming read, so I would not recommend it as it's not a great idea to build an in-memory object of an entire VCF file. A better strategy would be to use closer-to-bare-metal tools such as bcftools to extract information, then use R or Python to compute on extracted information.

ADD REPLYlink written 9 months ago by RamRS20k
1
gravatar for finswimmer
9 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

See bcftools query.


EDIT: WIth bcftools query you can print any information you like. So in your case e.g.:

$ bcftools query -f '%CHROM %POS  %REF  %ALT [ %GT]\n' input.vcf

The output looks now like this:

chr1 10177  ACC  ACCC  0/1
chr1 10327  T  C  0/0
chr1 10352  TAC  TAAC  1/1
chr1 12783  G  A  1/1

fin swimmer

ADD COMMENTlink modified 9 months ago • written 9 months ago by finswimmer11k

I think this should be a comment, as it's more of a suggestion than a solution. See, for example, cpad's comment pointing to the same resource.

ADD REPLYlink written 9 months ago by RamRS20k

Hello Ram,

if an "answer" is just intended for full copy&paste solution then my post is indeed more a comment. But I thought that telling the tool with it's subcommand and linking to the good manual is an answer enough.

I extended my post now to an full answer :)

cpad was faster than me, right. I didn't saw his answer as I haven't reload the page.

fin swimmer

ADD REPLYlink modified 9 months ago • written 9 months ago by finswimmer11k

Hi finswimmer! Can I ask for the trick to convert the output to symbolic genotypes for your example as

chr1 10177  ACC  ACCC  ACC/ACCC
chr1 10327  T  C  T/T
chr1 10352  TAC  TAAC  TAAC/TAAC
chr1 12783  G  A  A/A

Searched for a whole, but just did not have my luck.

ADD REPLYlink modified 3 months ago • written 3 months ago by yifangt8610
2

Hello yifangt86 ,

that's also described in the manual I've linked to:

 $ bcftools query -f '%CHROM %POS  %REF  %ALT [ %TGT]\n' input.vcf

fin swimmer

ADD REPLYlink written 3 months ago by finswimmer11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour