Question: Extract gene names from annotated vcf file
0
gravatar for banerjeeshayantan
2.1 years ago by
banerjeeshayantan160 wrote:

I have an annotated vcf file. I want to extract the gene name for each variant. How can I do this? This is the field I am interested in :

ANN=T|intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000120274|protein_coding
|1/16|c.-169+10295G>T||||||,T|intron_variant|MODIFIER**|Plekhg1**|ENSMUSG00000040624|transcript|ENSMUST00000144543|retained_intron| 
 1/7|n.163+10295G>T||||||,T|
intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000137111|retained_intron|1/7|n.343+9828G>T||||||"

I want to extract "Plekhg1" from the above entry.

sequencing next-gen • 1.9k views
ADD COMMENTlink modified 11 months ago by bioguy24190 • written 2.1 years ago by banerjeeshayantan160

I formatted the line to better visualize it I am not sure if all of that is supposed to be on a single line.

Is the gene name always in the 4th field (separator |)?

ADD REPLYlink written 2.1 years ago by genomax80k
3
gravatar for Nandini
2.1 years ago by
Nandini840
Nandini840 wrote:

How have you got this annotation ?

Is this done generated using SNPeff ? If so, you can extract gene names using snpsift

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:"
ADD COMMENTlink written 2.1 years ago by Nandini840

Thanks for answering. Yeah it was generated using SNPEFF. But I don't seem to understand the output of your command. It's giving me the entire vcf file as output. Can I get just the list of genes?

ADD REPLYlink written 2.1 years ago by banerjeeshayantan160

Please read the documentation of snpsift to understand usage of the commands

Not sure if this would work but you can try try:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:"
ADD REPLYlink written 2.1 years ago by Nandini840

Hi,

Was there an answer to this because I am too having the same issues "ANN[*].GENE:" is just outputting all the ANN fields and not the gene name specifically.

Thanks, Anj

ADD REPLYlink written 18 months ago by anjeetjhutty0

Try "ANN[*].GENE". This worked for both GENE and GENEID on my vcf.

ADD REPLYlink written 17 months ago by tmajaria0
0
gravatar for Shicheng Guo
11 months ago by
Shicheng Guo8.1k
Shicheng Guo8.1k wrote:

Usually, the fouth will be gene symbol, they this one:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:" | awk -F"|" '{print $4}'

The best choice will be

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$8}' OFS="\t"
ADD COMMENTlink modified 11 months ago • written 11 months ago by Shicheng Guo8.1k
0
gravatar for bioguy24
11 months ago by
bioguy24190
Chicago
bioguy24190 wrote:

I have not used SNPeff, but if the gene name is always in the fourth field seperated by |

awk -F'|' '{print $4}'
ADD COMMENTlink written 11 months ago by bioguy24190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2255 users visited in the last hour