Question: Annotation Of Genomic Positions
gravatar for secretjess
6.1 years ago by
secretjess170 wrote:

I'm working with some human breakpoint data:

Chr.L    Pos.L    Strand.L    Chr.H    Pos.H    Strand.H
18    19092052    +    18    30289323    +

I would like to know where breakpoints are generally occuring, e.g. joining together two exons, introns, UTRs, etc.

I have tried querying Ensembl via Biomart in R using:

attributes = c("transcript_biotype"), filters = c("chromosomal_region")

When I use the first position 18:19092052:19092052:1 it returns some transcripts which are out of range (e.g. 18822203-19035091) but seems to return the correct transcript with transcript start and end values overlapping the input, so I can work with that.

However for the second position 18:30289323:30289323:1 it does not return anything. Does this mean it is noncoding DNA? Is this happening because I am querying Ensembl Genes? I can live with that too but I'd just like it confirmed.

Otherwise, is there a better way I could do this? Perhaps using an SNP tool, like ANNOVAR?

ADD COMMENTlink modified 6.1 years ago by Emily_Ensembl18k • written 6.1 years ago by secretjess170
gravatar for Emily_Ensembl
6.1 years ago by
Emily_Ensembl18k wrote:

Hi Jess

The reason you've had a problem with BioMart is that you've specified the strand. The gene affected by the second breakpoint is on the reverse strand, but since you've specified a forward strand region, you haven't got a hit.

You can use the Ensembl VEP ( to analyse the effects of large scale variation. You need to convert this into VCF format (, for example:


18 19092052 sv1 . (arrow bracket)DEL(arrow bracket) . . SVTYPE=DUP;END=30289323 .

This will give you a list of all the genes affected by this CNV. Most will be labelled as transcript_ablations, but those affected by the breakpoints will come out as things like stop_lost, coding_sequence_variant, 3_prime_UTR_variant, intron_variant, feature_truncation etc.

Alternatively if you just want to know what's at the breakpoint, you can put the two points in as if they were SNPs, inventing alleles for them, eg: 18 19092052 19092052 A/G + 18 30289323 30289323 A/G +

This will just give you a table of the genes and transcripts affected.

The VEP is also available as a Perl script, which you can use offline using a cache, which is much quicker than online.

To be honest, if you're only looking at two breakpoints, the easiest way is to look at them in the Ensembl browser.


ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Emily_Ensembl18k

Hi Emily - many thanks for your help! :) I've mislead you a bit with my question though, sorry! I have a lot of breakpoint data to analyse, that example was just the first line in one of the files. Also the forward strand is correct for both breakpoints (it is a large deletion). VEP looks interesting, I've previously been using tabix/vcftools to query 1000 genomes but that looks like it could be much easier. The SNPs idea is great too.

ADD REPLYlink written 6.1 years ago by secretjess170

Wasn't sure if you had lots or just one or two. You can analyse the whole lot at once with the VEP. If you've got lots, I recommend the Perl script and downloading a cache:

ADD REPLYlink written 6.1 years ago by Emily_Ensembl18k

A breakpoint surely doesn't have a strand, as both strands are broken. If you specify a strand to BioMart it will only look for features on one strand, but if you don't specify a strand, it will look for features on both strands, which is what you want.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Emily_Ensembl18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1020 users visited in the last hour