Extracting allele, Genotype from VCF file
2
2
Entering edit mode
6.0 years ago

How to extract allel, Genotype from vcf file using python or other language for 23GB files?

Well, I am able to right script to get allele but for large VCF file it's difficult? what should other possible way to get allele, Genotype information?

vcf • 14k views
ADD COMMENT
1
Entering edit mode

try bcftools query .

ADD REPLY
0
Entering edit mode

how about VCFtools?

ADD REPLY
0
Entering edit mode

Why is this a tool post? A question about tools should be a question-type post, not a tool-type post.

ADD REPLY
0
Entering edit mode

What have you tried?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Why have you replied to my comment, Kevin?

ADD REPLY
0
Entering edit mode

Did not want to create yet another 4th and independent comment

ADD REPLY
0
Entering edit mode

You can take a look at this two scripts wrote in python to split a vcf and select what you want : A: VCF file help and C: parsing vcf file

ADD REPLY
5
Entering edit mode
5.9 years ago

See bcftools query.


EDIT: WIth bcftools query you can print any information you like. So in your case e.g.:

$ bcftools query -f '%CHROM %POS  %REF  %ALT [ %GT]\n' input.vcf

The output looks now like this:

chr1 10177  ACC  ACCC  0/1
chr1 10327  T  C  0/0
chr1 10352  TAC  TAAC  1/1
chr1 12783  G  A  1/1

fin swimmer

ADD COMMENT
0
Entering edit mode

I think this should be a comment, as it's more of a suggestion than a solution. See, for example, cpad's comment pointing to the same resource.

ADD REPLY
0
Entering edit mode

Hello Ram,

if an "answer" is just intended for full copy&paste solution then my post is indeed more a comment. But I thought that telling the tool with it's subcommand and linking to the good manual is an answer enough.

I extended my post now to an full answer :)

cpad was faster than me, right. I didn't saw his answer as I haven't reload the page.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi finswimmer! Can I ask for the trick to convert the output to symbolic genotypes? for your example:

chr1 10177  ACC  ACCC  ACC/ACCC
chr1 10327  T  C  T/T
chr1 10352  TAC  TAAC  TAAC/TAAC
chr1 12783  G  A  A/A

Searched for a whole, but just did not have my luck.

ADD REPLY
4
Entering edit mode

Hello yifangt86 ,

that's also described in the manual I've linked to:

 $ bcftools query -f '%CHROM %POS  %REF  %ALT [ %TGT]\n' input.vcf

fin swimmer

ADD REPLY
3
Entering edit mode
5.9 years ago

Extracting genotype information using R.

library(vcfR)
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf, element = c('GT'), as.numeric = TRUE)

For python take a look at the following article.

http://alimanfoo.github.io/2017/06/14/read-vcf.html

Genotypes can also be extracted using SnpSift.jar in snpEff using the following command.

java -jar ../snpEff/SnpSift.jar extractFields annotated.vcf   CHROM POS REF ALT  "GEN[*].GT" > output.tsv
ADD COMMENT
1
Entering edit mode

Doesn't look like vcfR does streaming read, so I would not recommend it as it's not a great idea to build an in-memory object of an entire VCF file. A better strategy would be to use closer-to-bare-metal tools such as bcftools to extract information, then use R or Python to compute on extracted information.

ADD REPLY

Login before adding your answer.

Traffic: 3149 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6