Question: Extracting allele, Genotype from VCF file
0
gravatar for sukhjindermultani85
17 months ago by
sukhjindermultani850 wrote:

How to extract allel, Genotype from vcf file using python or other language for 23GB files? Well, I am able to right script to get allel but for large VCF file its difficult ? what should other possible way to get allel, Genotype information?

tool genotype allele vcf • 2.8k views
ADD COMMENTlink modified 16 months ago by finswimmer12k • written 17 months ago by sukhjindermultani850
1

try bcftools query .

ADD REPLYlink modified 16 months ago • written 16 months ago by cpad011212k

how about VCFtools?

ADD REPLYlink modified 17 months ago • written 17 months ago by FatihSarigol130

Why is this a tool post? A question about tools should be a question-type post, not a tool-type post.

ADD REPLYlink written 17 months ago by RamRS24k

What have you tried?

ADD REPLYlink written 17 months ago by RamRS24k

May help the user (AWK ideas):

Actually, I have a Python script that can parse a VCF, in fact: Filtering VCF with python

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe49k

Why have you replied to my comment, Kevin?

ADD REPLYlink written 17 months ago by RamRS24k

Did not want to create yet another 4th and independent comment

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe49k

You can take a look at this two scripts wrote in python to split a vcf and select what you want : A: VCF file help and C: parsing vcf file

ADD REPLYlink modified 17 months ago • written 17 months ago by Bastien HervĂ©4.4k
2
gravatar for arup
16 months ago by
arup1.8k
India
arup1.8k wrote:

Extracting genotype information using R.

library(vcfR)
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf, element = c('GT'), as.numeric = TRUE)

For python take a look at the following article.

http://alimanfoo.github.io/2017/06/14/read-vcf.html

Genotypes can also be extracted using SnpSift.jar in snpEff using the following command.

java -jar ../snpEff/SnpSift.jar extractFields annotated.vcf   CHROM POS REF ALT  "GEN[*].GT" > output.tsv
ADD COMMENTlink modified 11 months ago • written 16 months ago by arup1.8k
1

Doesn't look like vcfR does streaming read, so I would not recommend it as it's not a great idea to build an in-memory object of an entire VCF file. A better strategy would be to use closer-to-bare-metal tools such as bcftools to extract information, then use R or Python to compute on extracted information.

ADD REPLYlink written 16 months ago by RamRS24k
1
gravatar for finswimmer
16 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

See bcftools query.


EDIT: WIth bcftools query you can print any information you like. So in your case e.g.:

$ bcftools query -f '%CHROM %POS  %REF  %ALT [ %GT]\n' input.vcf

The output looks now like this:

chr1 10177  ACC  ACCC  0/1
chr1 10327  T  C  0/0
chr1 10352  TAC  TAAC  1/1
chr1 12783  G  A  1/1

fin swimmer

ADD COMMENTlink modified 16 months ago • written 16 months ago by finswimmer12k

I think this should be a comment, as it's more of a suggestion than a solution. See, for example, cpad's comment pointing to the same resource.

ADD REPLYlink written 16 months ago by RamRS24k

Hello Ram,

if an "answer" is just intended for full copy&paste solution then my post is indeed more a comment. But I thought that telling the tool with it's subcommand and linking to the good manual is an answer enough.

I extended my post now to an full answer :)

cpad was faster than me, right. I didn't saw his answer as I haven't reload the page.

fin swimmer

ADD REPLYlink modified 16 months ago • written 16 months ago by finswimmer12k

Hi finswimmer! Can I ask for the trick to convert the output to symbolic genotypes? for your example:

chr1 10177  ACC  ACCC  ACC/ACCC
chr1 10327  T  C  T/T
chr1 10352  TAC  TAAC  TAAC/TAAC
chr1 12783  G  A  A/A

Searched for a whole, but just did not have my luck.

ADD REPLYlink modified 5 months ago • written 10 months ago by yifangt8610
2

Hello yifangt86 ,

that's also described in the manual I've linked to:

 $ bcftools query -f '%CHROM %POS  %REF  %ALT [ %TGT]\n' input.vcf

fin swimmer

ADD REPLYlink written 10 months ago by finswimmer12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour