Question

Annotation Of Genomic Positions

1

Entering edit mode

11.1 years ago

secretjess ▴ 210

I'm working with some human breakpoint data:

Chr.L    Pos.L    Strand.L    Chr.H    Pos.H    Strand.H
18    19092052    +    18    30289323    +

I would like to know where breakpoints are generally occuring, e.g. joining together two exons, introns, UTRs, etc.

I have tried querying Ensembl via Biomart in R using:

attributes = c("transcript_biotype"), filters = c("chromosomal_region")

When I use the first position 18:19092052:19092052:1 it returns some transcripts which are out of range (e.g. 18822203-19035091) but seems to return the correct transcript with transcript start and end values overlapping the input, so I can work with that.

However for the second position 18:30289323:30289323:1 it does not return anything. Does this mean it is noncoding DNA? Is this happening because I am querying Ensembl Genes? I can live with that too but I'd just like it confirmed.

Otherwise, is there a better way I could do this? Perhaps using an SNP tool, like ANNOVAR?

genomic chromosome coordinates cds utr • 3.1k views

ADD COMMENT • link updated 11.1 years ago by Emily 23k • written 11.1 years ago by secretjess ▴ 210

score 1 · Answer 1 · 2013-04-08

1

Entering edit mode

11.1 years ago

Emily 23k

Hi Jess

The reason you've had a problem with BioMart is that you've specified the strand. The gene affected by the second breakpoint is on the reverse strand, but since you've specified a forward strand region, you haven't got a hit.

You can use the Ensembl VEP (http://www.ensembl.org/tools.html) to analyse the effects of large scale variation. You need to convert this into VCF format (http://www.ensembl.org/info/docs/variation/vep/vep_formats.html#vcf), for example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

18 19092052 sv1 . (arrow bracket)DEL(arrow bracket) . . SVTYPE=DUP;END=30289323 .

This will give you a list of all the genes affected by this CNV. Most will be labelled as transcript_ablations, but those affected by the breakpoints will come out as things like stop_lost, coding_sequence_variant, 3_prime_UTR_variant, intron_variant, feature_truncation etc.

Alternatively if you just want to know what's at the breakpoint, you can put the two points in as if they were SNPs, inventing alleles for them, eg: 18 19092052 19092052 A/G + 18 30289323 30289323 A/G +

This will just give you a table of the genes and transcripts affected.

The VEP is also available as a Perl script, which you can use offline using a cache, which is much quicker than online.

To be honest, if you're only looking at two breakpoints, the easiest way is to look at them in the Ensembl browser.

Emily

ADD COMMENT • link 11.1 years ago by Emily 23k

0

Entering edit mode

Hi Emily - many thanks for your help! :) I've mislead you a bit with my question though, sorry! I have a lot of breakpoint data to analyse, that example was just the first line in one of the files. Also the forward strand is correct for both breakpoints (it is a large deletion). VEP looks interesting, I've previously been using tabix/vcftools to query 1000 genomes but that looks like it could be much easier. The SNPs idea is great too.

ADD REPLY • link 11.1 years ago by secretjess ▴ 210

0

Entering edit mode

Wasn't sure if you had lots or just one or two. You can analyse the whole lot at once with the VEP. If you've got lots, I recommend the Perl script and downloading a cache:

http://www.ensembl.org/info/docs/variation/vep/vep_script.html

ADD REPLY • link 11.1 years ago by Emily 23k

0

Entering edit mode

A breakpoint surely doesn't have a strand, as both strands are broken. If you specify a strand to BioMart it will only look for features on one strand, but if you don't specify a strand, it will look for features on both strands, which is what you want.

ADD REPLY • link 11.1 years ago by Emily 23k