Question: Extract gene names from annotated vcf file
0
gravatar for banerjeeshayantan
2.6 years ago by
banerjeeshayantan170 wrote:

I have an annotated vcf file. I want to extract the gene name for each variant. How can I do this? This is the field I am interested in :

ANN=T|intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000120274|protein_coding
|1/16|c.-169+10295G>T||||||,T|intron_variant|MODIFIER**|Plekhg1**|ENSMUSG00000040624|transcript|ENSMUST00000144543|retained_intron| 
 1/7|n.163+10295G>T||||||,T|
intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000137111|retained_intron|1/7|n.343+9828G>T||||||"

I want to extract "Plekhg1" from the above entry.

sequencing next-gen • 2.5k views
ADD COMMENTlink modified 17 months ago by bioguy24190 • written 2.6 years ago by banerjeeshayantan170

I formatted the line to better visualize it I am not sure if all of that is supposed to be on a single line.

Is the gene name always in the 4th field (separator |)?

ADD REPLYlink written 2.6 years ago by genomax90k
3
gravatar for Nandini
2.6 years ago by
Nandini860
Nandini860 wrote:

How have you got this annotation ?

Is this done generated using SNPeff ? If so, you can extract gene names using snpsift

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:"
ADD COMMENTlink written 2.6 years ago by Nandini860

Thanks for answering. Yeah it was generated using SNPEFF. But I don't seem to understand the output of your command. It's giving me the entire vcf file as output. Can I get just the list of genes?

ADD REPLYlink written 2.6 years ago by banerjeeshayantan170

Please read the documentation of snpsift to understand usage of the commands

Not sure if this would work but you can try try:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:"
ADD REPLYlink written 2.6 years ago by Nandini860

Hi,

Was there an answer to this because I am too having the same issues "ANN[*].GENE:" is just outputting all the ANN fields and not the gene name specifically.

Thanks, Anj

ADD REPLYlink written 2.0 years ago by anjeetjhutty0

Try "ANN[*].GENE". This worked for both GENE and GENEID on my vcf.

ADD REPLYlink written 23 months ago by tmajaria0

What if I want only a list of all unique genes?

ADD REPLYlink written 5 months ago by atheeth140
0
gravatar for Shicheng Guo
17 months ago by
Shicheng Guo8.3k
Shicheng Guo8.3k wrote:

Usually, the fouth will be gene symbol, they this one:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:" | awk -F"|" '{print $4}'

The best choice will be

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$8}' OFS="\t"
ADD COMMENTlink modified 17 months ago • written 17 months ago by Shicheng Guo8.3k
0
gravatar for bioguy24
17 months ago by
bioguy24190
Chicago
bioguy24190 wrote:

I have not used SNPeff, but if the gene name is always in the fourth field seperated by |

awk -F'|' '{print $4}'
ADD COMMENTlink written 17 months ago by bioguy24190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2193 users visited in the last hour