Chr Start End Size Value Comment BookmarkType INFO

Question

annotation of SV (structural variants)

9

Entering edit mode

6.2 years ago

Bogdan ★ 1.4k

Dear all,

which tools would you recommend to use for the annotation of SV (structural variants) with gene information (SV including deletions, insertions, translocations, inversions, duplications). thank you !

-- bogdan

SV annotation • 12k views

ADD COMMENT • link updated 4.3 years ago by Fire • 0 • written 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

You have these stored in VCF format?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

yes, Kevin, the files are in VCF format.

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

Can you try something like ANNOVAR? - start from the Quick Start-Up Guide. Installing everything can take a while.

Then, to annotate a VCF, you would do something like this:

perl annovar/convert2annovar.pl -format vcf4 -withzyg --allsample -outfile anotation.ann MySV.vcf ;

perl annovar/table_annovar.pl anotation.ann.avinput annovar/humandb/ -buildve hg19 -remove -otherinfo -onetranscript -protocol refGene,cytoBand,esp6500siv2_all,exac03,dbnsfp30a,avsnp147,cosmic70 -operation g,r,f,f,f,f,f -nastring "NA" -csvout ;

ANNOVAR Also allows you to annotate via the Database of Genomic Variants (DGV).

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

SV annotation (with OMIM, DGV, 1000g, haploinsufficiency, TAD, ... and also with your own in-house information) can be easily automated !

You can look at this post describing the annotSV tool: Annotation for SV and CNV

ADD REPLY • link 5.8 years ago by LGMgeo ▴ 100

score 2 · Answer 1 · 2018-01-29

2

Entering edit mode

6.2 years ago

JJ ▴ 670

I recommend snpeff. You can annotate SNPs and SVs together in one go :) And you get a nice report!

ADD COMMENT • link 6.2 years ago by JJ ▴ 670

0

Entering edit mode

Hi, how can you use Snpeff to annotate translocations for example?

ADD REPLY • link 6.2 years ago by alons ▴ 270

1

Entering edit mode

I have tried SnpEff with an output vcf file from DELLY 0.7.5, where the TRANSLOCATIONS are annotated in <tra> format, and not in <bnd> format, and the latest version of SnpEff did not wok on <tra> items.

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

I see, I'll try too, please update if you succeed. I will as well.

ADD REPLY • link 6.2 years ago by alons ▴ 270

0

Entering edit mode

You should update to Delly to 0.7.7 which uses the BND format - then it will work

ADD REPLY • link 6.2 years ago by JJ ▴ 670

score 1 · Answer 2 · 2018-01-28

1

Entering edit mode

6.2 years ago

d-cameron ★ 2.9k

This very much depends on what sort of annotation you want to do.

If you're interested in breakend overlap annotation, then I recommend using my StructuralVariantAnnotation package to convert to a paired breakend GRanges object then using the standard bioconductor packages to annotate stuff like repeat overlaps, gene position and orientation (ie is this a viable fusion gene), and so on.

If you're wanting to do annotations of the functional impact of the variant, then this is very much a different problem and the state of the art in this is well behind that of SNVs and small indels.

ADD COMMENT • link 6.2 years ago by d-cameron ★ 2.9k

1

Entering edit mode

Dear Daniel, thank you for your suggestions, and congratulations for the publication of GRIDSS package.

Regarding the package "StructuralVariantAnnotation", I have started learning how to use it, and if I may ask please 2 questions, I would appreciate having your replies :

-- please would it be possible to share a piece of R code that uses "findBreakpointOverlaps(gr)" in order to infer the overlap of the breakpoints with exons, introns or intergenic regions ? -- after the overlap of the breakpoints with the gene information, is there any function available that infer gene fusions or truncations ? (i used to do it with a perl script, although having everything in R, it would be very helpful).

if you wish, we can also talk via email. My email address is tanasa@gmail.com. Thank you ;)

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

1

Entering edit mode

please would it be possible to share a piece of R code that uses "findBreakpointOverlaps(gr)" in order to infer the overlap of the breakpoints with exons, introns or intergenic regions ?

The nice thing about the breakpoint format I'm using (a normal bioconductor GRanges object with just an extra $partner field to identify the other side) is that you don't have to do anything special for this. Looking at breakend overlaps is just like looking at SNV/small indel overlaps.

You can see an example of this in https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R which is a very simple script that checks for candidate gene fusion. In line 52, you can see that to calculating the gene overlapping a breakend just uses the standard GenomicRanges findOverlaps function. findBreakpointOverlaps is useful when you want find overlaps between breakpoints (e.g. comparing a SV caller to a truth set, or creating an ensemble SV call set), but if you're just comparing breakends to genomic intervals, StructuralVariantAnnotation is only needed for the conversion of VCF to the breakend GRanges format.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel, many many thanks ! very much appreciate sharing the code and your suggestions. Please may I ask, the code infers the fusions starting only from TRA/BND, or we can use also DEL, DUP, or INV in order to infer the fusions ? Thanks !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

StructuralVariantAnnotation will convert the output of SV callers into the same GRanges format regardless of whether the caller used TRA/BND or DEL/DUP/INV format to make the call. It works with VCFv4.2 or VCFv4.3 compliant output, but I've added some extra code to handle some of the more popular callers that don't follow the VCF specs correctly.

I've used it successfully on breakdancer, cortex, crest, delly, gridss, hydra, lumpy, manta, pindel, and socrates (although you need to use my conversion script to turn the breakdancer & socrates outputs into VCF format).

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel, thank you. We have used DELLY and LUMPY, and now i am looking also into GRIDSS, as our friends from Stanford were very very excited about using GRIDSS on calling SV in cancer genomes.

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

Hi Daniel, thank you, the R code works very well. If I may add a question please:

when using breakpointRanges(vcf) to transform a VCF into a GR, how are strands + and - especially defined/assigned ? I am using a vcf file from DELLY, where the TRA (and other SV) are encoded as 3to5, 5to3, 3to3, or 5tot5. thanks a lot !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

indicates that the DNA before the breakpoint is involved in the adjacency, - is the other orientation. For a 500bp chr1 the following GRanges chr1 100 + chr1 200 -

Would indicate a deletion as the DNA leading up to chr1:100 is connected to the DNA at chr1:200 and beyond.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel, thank you for your reply.

Please if I may add a note below, as I would like to understand much better how the "+" and "-" are assigned please :

we have been using DELLY on a few cancer genomes that we had, and DELLY gives the following SV:

DEL are always 3to5

DUP are always 5to3

INV can be 5to5 or 3to3

TRA can be 5to5, 3to3, 5to3 or 3to5.

How do these SV above translate into a notation with + and - ? thank you very much again !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

The special case handling for DELLY TRA code can be found https://github.com/PapenfussLab/StructuralVariantAnnotation/blob/master/R/extensions-VCF.R#L336 .

I just reviewed the code and the custom DELLY 5to5 and 3to3 attributes are not used for inversions (I just found this issue myself with manta INV3 and INV5 non-standard attributes). The current version of StructuralVariantAnnotation will ignore 5to5 and 3to3 and report both inversion breakpoints.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel, thanks a lot ! we wish you a happy and fruitful week !

If I may add a question, this time if about TRA, specifically about 3to5, and 5to3 TRANSLOCATIONS.

When do we define a TRA as 3to5, and when it is a TRA 5to3 ? I believe that these are RECIPROCAL TRA.

thank you !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

Reciprocal translocations should have two separate translocation events at the same position. Depending on the DNA repair mechanism, the 'same' position are usually slightly different positions.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel, thank you. May I add and ask : how is a TRA 3to5 different than a TRA 5to3 ? thank you ;) !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

And talking about INV,

yes, thank you, I am very happy that your R package reports annotations for both 5to5 ad 3to3 INV breakpoints ;)

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

And, if I may add please another question (that is a bit naive as I have not checked the answer yet):

depending on the type of SV and depending on the type of SV direction/orientation (3to5, 5to3, 5to5, 3tot3),

shall we refer to INVERSIONS,

would we need to include in the R script that computes the gene fusions a piece of R code for instance that inverts the INV coordinates, or the function "breakpointRanges" does it internally already ?

I am asking because I would also like to compute the TRUNCATIONS. thank you very much ;)

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

In my gene fusion example code, gene orientation and the breakpoint direction must be consistent with the fusion. An upstream gene in the fusion must have the strand matching the breakpoint orientation (+ strand transcript would require a + breakpoint orientation), and the downstream gene requires the oppposite.

Truncation logic will be quite similar.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Dear Daniel,

talking about TRUNCATIONS, shall we define the gene TRUNCATIONS, based on your R code that is available at https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R, I believe that we only have to change the lines 70-72 of the R code by "negation" of "couldBeFivePrimeEnd" :

gr <- gr[ (gr$couldBeThreePrimeStart & (! partner(gr)$couldBeFivePrimeEnd)) | ( (! gr$couldBeFivePrimeEnd) & partner(gr)$couldBeThreePrimeStart),]

thank you very much !

ADD REPLY • link 6.2 years ago by Bogdan ★ 1.4k

0

Entering edit mode

Hi Daniel,
I have tried processing delly v0.7.9 output vcf file with breakpointRanges() and all of the BNDs were removed as unpaired, so I was wondering if there was a way around this issue, using your package?

ADD REPLY • link 5.5 years ago by anamaria ▴ 30

0

Entering edit mode

The DELLY authors updated their output to use BND notation (which is good) but they only write one of the two records required for a valid VCF BND breakpoint (which is bad).

Technically my tool is doing the correct thing but that isn't particularly useful for users. I have a few other projects taking my time at the moment but I should have an updated version that handles the DELLY notational non-compliance by the end of this month.

Edit: as a workaround, you can update the DELLY VCF to include the other side of the BND breakpoint.

ADD REPLY • link 5.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thanks a lot, Daniel ! yes, if you could please let us know when we can start using it, it would be very helpful !

ADD REPLY • link 5.4 years ago by Bogdan ★ 1.4k

score 0 · Answer 3 · 2020-01-07

@Kevin Blighe dear Kevin Blighe,

i had the gsa microarray data in which i wanted to identify the cnv's .... the cnv were known through illumina genome studio software which gave txt file like this.

SampleID BookmarkType Chr Start End Size Author CreatedDate Value Comment 203389250004_R01C01 [1] CNV Bin: Min 2.5 To Max 3.5 3 197788369 197871052 82683 12/28/2019 7:31:44 PM 3 CNV Confidence: 58.26665 203389250004_R01C01 [1] CNV Bin: Min 0.5 To Max 1.5 9 105596881 105835972 239091 12/28/2019 7:31:44 PM 1 CNV Confidence: 95.1899

for annotation of this txt file with annovar tool i converted it into vcf format which looked something like this

Chr Start End Size Value Comment BookmarkType INFO

3 197788369 197871052 82683 3 CNV Confidence: 58.26665 CNV Bin: Min 2.5 To Max 3.5 203389250004_R01C01 [1],,12/28/2019 7:31:44 PM 9 105596881 105835972 239091 1 CNV Confidence: 95.1899 CNV Bin: Min 0.5 To Max 1.5 203389250004_R01C01 [1],,12/28/2019 7:31:44 PM

i want to know is this file good for inputting it into annovar and also how to annotate this vcf file using annovar, once i have downloaded all the database defined by you in this post earlier i have followed the commands given by you

perl annovar/convert2annovar.pl -format vcf4 -withzyg --allsample -outfile anotation.ann input.vcf

perl annovar/table_annovar.pl anotation.ann annovar/humandb/ -buildve hg19 -remove -otherinfo -onetranscript -protocol refGene,cytoBand,esp6500siv2_all,exac03,dbnsfp35c,avsnp150,cosmic70 -operation g,r,f,f,f,f,f -nastring "NA" -csvout it generates file something like this

Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,cytoBand,esp6500siv2_all,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS,SIFT_score,SIFT_converted_rankscore,SIFT_pred,LRT_score,LRT_converted_rankscore,LRT_pred,MutationTaster_score,MutationTaster_converted_rankscore,MutationTaster_pred,MutationAssessor_score,MutationAssessor_score_rankscore,MutationAssessor_pred,FATHMM_score,FATHMM_converted_rankscore,FATHMM_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,MetaSVM_score,MetaSVM_rankscore,MetaSVM_pred,MetaLR_score,MetaLR_rankscore,MetaLR_pred,M-CAP_score,M-CAP_rankscore,M-CAP_pred,MutPred_score,MutPred_rankscore,fathmm-MKL_coding_score,fathmm-MKL_coding_rankscore,fathmm-MKL_coding_pred,Eigen_coding_or_noncoding,Eigen-raw,Eigen-PC-raw,GenoCanyon_score,GenoCanyon_score_rankscore,integrated_fitCons_score,integrated_fitCons_score_rankscore,integrated_confidence_value,GERP++_RS,GERP++_RS_rankscore,phyloP100way_vertebrate,phyloP100way_vertebrate_rankscore,phyloP20way_mammalian,phyloP20way_mammalian_rankscore,phastCons100way_vertebrate,phastCons100way_vertebrate_rankscore,phastCons20way_mammalian,phastCons20way_mammalian_rankscore,SiPhy_29way_logOdds,SiPhy_29way_logOdds_rankscore,Interpro_domain,GTEx_V6p_gene,GTEx_V6p_tissue,avsnp150,cosmic70,Otherinfo1,Otherinfo2,Otherinfo3 3,197788369,197788369,0,0,"ncRNA_intronic","ANKRD18DP",NA,NA,NA,"3q29",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,". CNV Confidence: 58.26665

i want to know the rsids of the cnv for this particular position only and so on in my file. Can you please help me out with this and if i am wrong somewhere please correct me.

thankyou,