Question: Annotation with snpEFF
0
gravatar for Nandini
4 months ago by
Nandini700
London
Nandini700 wrote:

I have been using the following commands to annotate my vcf file with Refseq annotation on SnpEff 4.3t, followed by snpsift to extract columns of interest.

java -Xmx4g -jar snpEff.jar hg19 $panel.sorted.vcf > $panel.snpeff.hg19.vcf

java -jar SnpSift.jar extractFields -s "," -e "." $panel.snpeff.hg19.vcf CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" > $panel.snpeff.hg19.HGVS.txt

I can see that some feature IDs have been annotated with non-refseq annotations(eg 4PED, 4IDN etc) and the extraction of HGVS_P is not formatted.

Is there a better way to extract and format RefSeq and HGVS annotations ?

Thank you

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4    c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T .,.,.,.,.,p.Phe480Phe
1   227174210   rs3738725   T   C   4PED:A_572-A_630:NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4    c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   .,p.Ser572Ser,.,.
14  51057727    rs1060197   G   A   4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   .,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   .,.
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  .,.,.,.,.,.
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 4 months ago by Nandini700
1

Hello Nandini,

and the extraction of HGVS_P is not formatted

what do you mean by this?

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer6.7k

Hi fin swimmer,

So if you see the result snippet in the last column, the HGVS_P columns, when extracted looks like this

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.
ADD REPLYlink written 4 months ago by Nandini700

Yes, but this is correct. Without any paramater SnpEff annotates your variants for every transcript it has in its database.

Let's take this line:

X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.

There are 3 transcript. And so you have 3 values for HGVS_C and 3 values HGVS_P each seperated with a comma. If you just want one transcript per variant you have to provide a list of transcript numbers using -onlyTr <file.txt> or let SnpEff choose the canonical transcript with -canon.

fin swimmer

ADD REPLYlink modified 4 months ago • written 4 months ago by finswimmer6.7k

yup, unfortunately, we need to report all transcripts as we are working under a clinical setting.

ADD REPLYlink written 4 months ago by Nandini700

So if you need all transcripts, what is your question? :) All desired informations are there.

Can you give an example of the output format you like to have?

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer6.7k

My question is 1) Is there a way to avoid non-ref seq annotations such as 4PED:A_480-A_522 4IDN:B_77-B_117

    4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4
 4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3

2) Is there a better way to extract the columns in a formatted way without the commas.

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.

Desired out should be like

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4 c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T p.Phe480Phe
1   227174210   rs3738725   T   C   NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4 c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   p.Ser572Ser
14  51057727    rs1060197   G   A   NM_015915.4,NM_001127713.1,NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A
ADD REPLYlink modified 4 months ago • written 4 months ago by Nandini700

Could you please provide an unannotated extract of your vcf file, that one can play around with it?

ADD REPLYlink written 4 months ago by finswimmer6.7k

Should I paste it here or upload it on slack as DM ? Its a file with ~20 variants

ADD REPLYlink modified 4 months ago • written 4 months ago by Nandini700

Upload is fine if you can.

ADD REPLYlink written 4 months ago by finswimmer6.7k

I have uploaded the file on slack

ADD REPLYlink written 4 months ago by Nandini700
1

Hello,

I take a look at this. I haven't found an option for snpEff to get rid of those transcripts like 4PED:A_480-A_522. Furthermore I think your desired output is not a good idea. You like to remove empty values for HGVS_P. But doing so you lost the information which value belongs to which HGVS_C and transcript.

I have another suggestion: snpEff provides a script to split the annotation for each variant into one line per transcript. We could then remove those lines where the transcript doesn't start with N (or NM if you also want to remove NR). Your output will than looks like this:

14  78028803    rs2364602   A   G   NM_004863.3 c.786T>C    p.Asn262Asn
14  102508056   rs10129889  C   A   NM_001376.4 c.12087C>A  p.His4029Gln
14  102515015   rs1004903   G   A   NM_001376.4 c.13372+9G>A    .
X   70442845    rs6525485   G   A   NM_000166.5 c.-713G>A   .
X   70442845    rs6525485   G   A   NR_001568.1 n.173-12792C>T  .
X   70442845    rs6525485   G   A   NM_001097642.2  c.-16-697G>A    .

You get this result with

/path/to/snpEff/scripts/vcfEffOnePerLine.pl < input.snpeff.hg19.vcf \ 
| java -jar /path/to/snpEff/SnpSift.jar extractFields -s "," -e "." - CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" \ 
| awk -v OFS="\t" '$6 ~ "^N" {print}' > out.snpeff.hg19.HGVS.txt

fin swimmer

ADD REPLYlink modified 4 months ago • written 4 months ago by finswimmer6.7k

thanks fin swimmer.... this is helpful... I'll give it a go and see if this can be implemented for us!

ADD REPLYlink written 4 months ago by Nandini700

Sorry, just adding on to this, would anyone know if snpeff provides the annotation to encoded proteins (NP_) ids ? Currently, it annotates only to RefSeq transcripts (NM_ and NR_)

ADD REPLYlink written 4 months ago by Nandini700

Hello Nandini,

I'm not aware of such an option. But as NM and NP are bind to each other, you could find out the correspondig NP using BioMart.

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer6.7k

thanks fin swimmer.

It's when I annotate the tool with -hgvsTrId option, it ends up linking the refseq transcript to the protein (HGVS_P). So my results look like this now, which again is technically incorrect - HGVS_P should ideally be linked to a protein NP_ ids

CHROM   POS REF ALT ANN[*].HGVS_C   ANN[*].HGVS_P
5   148386525   T   G   NM_024577.3:c.3594A>C   NM_024577.3:p.Pro1198Pro
5   148407708   A   C   NM_024577.3:c.1587T>G   NM_024577.3:p.Arg529Arg
5   148407893   C   A   NM_024577.3:c.1402G>T   NM_024577.3:p.Ala468Ser
5   148408101   A   G   NM_024577.3:c.1194T>C   NM_024577.3:p.Gly398Gly
ADD REPLYlink modified 4 months ago • written 4 months ago by Nandini700
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour