Question: Annotation with snpEFF
0
gravatar for Nandini
26 days ago by
Nandini630
London
Nandini630 wrote:

I have been using the following commands to annotate my vcf file with Refseq annotation on SnpEff 4.3t, followed by snpsift to extract columns of interest.

java -Xmx4g -jar snpEff.jar hg19 $panel.sorted.vcf > $panel.snpeff.hg19.vcf

java -jar SnpSift.jar extractFields -s "," -e "." $panel.snpeff.hg19.vcf CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" > $panel.snpeff.hg19.HGVS.txt

I can see that some feature IDs have been annotated with non-refseq annotations(eg 4PED, 4IDN etc) and the extraction of HGVS_P is not formatted.

Is there a better way to extract and format RefSeq and HGVS annotations ?

Thank you

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4    c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T .,.,.,.,.,p.Phe480Phe
1   227174210   rs3738725   T   C   4PED:A_572-A_630:NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4    c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   .,p.Ser572Ser,.,.
14  51057727    rs1060197   G   A   4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   .,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   .,.
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  .,.,.,.,.,.
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.
ADD COMMENTlink modified 7 days ago by Biostar ♦♦ 20 • written 26 days ago by Nandini630
1

Hello Nandini,

and the extraction of HGVS_P is not formatted

what do you mean by this?

fin swimmer

ADD REPLYlink written 26 days ago by finswimmer3.6k

Hi fin swimmer,

So if you see the result snippet in the last column, the HGVS_P columns, when extracted looks like this

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.
ADD REPLYlink written 26 days ago by Nandini630

Yes, but this is correct. Without any paramater SnpEff annotates your variants for every transcript it has in its database.

Let's take this line:

X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.

There are 3 transcript. And so you have 3 values for HGVS_C and 3 values HGVS_P each seperated with a comma. If you just want one transcript per variant you have to provide a list of transcript numbers using -onlyTr <file.txt> or let SnpEff choose the canonical transcript with -canon.

fin swimmer

ADD REPLYlink modified 26 days ago • written 26 days ago by finswimmer3.6k

yup, unfortunately, we need to report all transcripts as we are working under a clinical setting.

ADD REPLYlink written 26 days ago by Nandini630

So if you need all transcripts, what is your question? :) All desired informations are there.

Can you give an example of the output format you like to have?

fin swimmer

ADD REPLYlink written 26 days ago by finswimmer3.6k

My question is 1) Is there a way to avoid non-ref seq annotations such as 4PED:A_480-A_522 4IDN:B_77-B_117

    4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4
 4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3

2) Is there a better way to extract the columns in a formatted way without the commas.

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.

Desired out should be like

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4 c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T p.Phe480Phe
1   227174210   rs3738725   T   C   NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4 c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   p.Ser572Ser
14  51057727    rs1060197   G   A   NM_015915.4,NM_001127713.1,NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A
ADD REPLYlink modified 26 days ago • written 26 days ago by Nandini630

Could you please provide an unannotated extract of your vcf file, that one can play around with it?

ADD REPLYlink written 26 days ago by finswimmer3.6k

Should I paste it here or upload it on slack as DM ? Its a file with ~20 variants

ADD REPLYlink modified 26 days ago • written 26 days ago by Nandini630

Upload is fine if you can.

ADD REPLYlink written 26 days ago by finswimmer3.6k

I have uploaded the file on slack

ADD REPLYlink written 26 days ago by Nandini630
1

Hello,

I take a look at this. I haven't found an option for snpEff to get rid of those transcripts like 4PED:A_480-A_522. Furthermore I think your desired output is not a good idea. You like to remove empty values for HGVS_P. But doing so you lost the information which value belongs to which HGVS_C and transcript.

I have another suggestion: snpEff provides a script to split the annotation for each variant into one line per transcript. We could then remove those lines where the transcript doesn't start with N (or NM if you also want to remove NR). Your output will than looks like this:

14  78028803    rs2364602   A   G   NM_004863.3 c.786T>C    p.Asn262Asn
14  102508056   rs10129889  C   A   NM_001376.4 c.12087C>A  p.His4029Gln
14  102515015   rs1004903   G   A   NM_001376.4 c.13372+9G>A    .
X   70442845    rs6525485   G   A   NM_000166.5 c.-713G>A   .
X   70442845    rs6525485   G   A   NR_001568.1 n.173-12792C>T  .
X   70442845    rs6525485   G   A   NM_001097642.2  c.-16-697G>A    .

You get this result with

/path/to/snpEff/scripts/vcfEffOnePerLine.pl < input.snpeff.hg19.vcf \ 
| java -jar /path/to/snpEff/SnpSift.jar extractFields -s "," -e "." - CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" \ 
| awk -v OFS="\t" '$6 ~ "^N" {print}' > out.snpeff.hg19.HGVS.txt

fin swimmer

ADD REPLYlink modified 25 days ago • written 25 days ago by finswimmer3.6k

thanks fin swimmer.... this is helpful... I'll give it a go and see if this can be implemented for us!

ADD REPLYlink written 25 days ago by Nandini630

Sorry, just adding on to this, would anyone know if snpeff provides the annotation to encoded proteins (NP_) ids ? Currently, it annotates only to RefSeq transcripts (NM_ and NR_)

ADD REPLYlink written 8 days ago by Nandini630

Hello Nandini,

I'm not aware of such an option. But as NM and NP are bind to each other, you could find out the correspondig NP using BioMart.

fin swimmer

ADD REPLYlink written 8 days ago by finswimmer3.6k

thanks fin swimmer.

It's when I annotate the tool with -hgvsTrId option, it ends up linking the refseq transcript to the protein (HGVS_P). So my results look like this now, which again is technically incorrect - HGVS_P should ideally be linked to a protein NP_ ids

CHROM   POS REF ALT ANN[*].HGVS_C   ANN[*].HGVS_P
5   148386525   T   G   NM_024577.3:c.3594A>C   NM_024577.3:p.Pro1198Pro
5   148407708   A   C   NM_024577.3:c.1587T>G   NM_024577.3:p.Arg529Arg
5   148407893   C   A   NM_024577.3:c.1402G>T   NM_024577.3:p.Ala468Ser
5   148408101   A   G   NM_024577.3:c.1194T>C   NM_024577.3:p.Gly398Gly
ADD REPLYlink modified 7 days ago • written 8 days ago by Nandini630
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1475 users visited in the last hour