Annotation Of Genomic Positions
1
1
Entering edit mode
8.2 years ago
secretjess ▴ 180

I'm working with some human breakpoint data:

Chr.L    Pos.L    Strand.L    Chr.H    Pos.H    Strand.H
18    19092052    +    18    30289323    +


I would like to know where breakpoints are generally occuring, e.g. joining together two exons, introns, UTRs, etc.

I have tried querying Ensembl via Biomart in R using:

attributes = c("transcript_biotype"), filters = c("chromosomal_region")

When I use the first position 18:19092052:19092052:1 it returns some transcripts which are out of range (e.g. 18822203-19035091) but seems to return the correct transcript with transcript start and end values overlapping the input, so I can work with that.

However for the second position 18:30289323:30289323:1 it does not return anything. Does this mean it is noncoding DNA? Is this happening because I am querying Ensembl Genes? I can live with that too but I'd just like it confirmed.

Otherwise, is there a better way I could do this? Perhaps using an SNP tool, like ANNOVAR?

genomic chromosome coordinates cds utr • 2.5k views
1
Entering edit mode
8.2 years ago

Hi Jess

The reason you've had a problem with BioMart is that you've specified the strand. The gene affected by the second breakpoint is on the reverse strand, but since you've specified a forward strand region, you haven't got a hit.

You can use the Ensembl VEP (http://www.ensembl.org/tools.html) to analyse the effects of large scale variation. You need to convert this into VCF format (http://www.ensembl.org/info/docs/variation/vep/vep_formats.html#vcf), for example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

18 19092052 sv1 . (arrow bracket)DEL(arrow bracket) . . SVTYPE=DUP;END=30289323 .

This will give you a list of all the genes affected by this CNV. Most will be labelled as transcript_ablations, but those affected by the breakpoints will come out as things like stop_lost, coding_sequence_variant, 3_prime_UTR_variant, intron_variant, feature_truncation etc.

Alternatively if you just want to know what's at the breakpoint, you can put the two points in as if they were SNPs, inventing alleles for them, eg: 18 19092052 19092052 A/G + 18 30289323 30289323 A/G +

This will just give you a table of the genes and transcripts affected.

The VEP is also available as a Perl script, which you can use offline using a cache, which is much quicker than online.

To be honest, if you're only looking at two breakpoints, the easiest way is to look at them in the Ensembl browser.

Emily

0
Entering edit mode

Hi Emily - many thanks for your help! :) I've mislead you a bit with my question though, sorry! I have a lot of breakpoint data to analyse, that example was just the first line in one of the files. Also the forward strand is correct for both breakpoints (it is a large deletion). VEP looks interesting, I've previously been using tabix/vcftools to query 1000 genomes but that looks like it could be much easier. The SNPs idea is great too.

0
Entering edit mode

Wasn't sure if you had lots or just one or two. You can analyse the whole lot at once with the VEP. If you've got lots, I recommend the Perl script and downloading a cache:

http://www.ensembl.org/info/docs/variation/vep/vep_script.html

0
Entering edit mode

A breakpoint surely doesn't have a strand, as both strands are broken. If you specify a strand to BioMart it will only look for features on one strand, but if you don't specify a strand, it will look for features on both strands, which is what you want.