Question: How to find genomic geographic commonalities between sequences
0
gravatar for sebastianzeki0
4.5 years ago by
United Kingdom
sebastianzeki0170 wrote:

OK so this may be broad but I have a load of sequencing associated with a clinical outcome I'm interested in. I would like to find a way of investigating the commonalities of these sequences in terms of where they are in the genome. Specifically I would like to look at things like: Distance to centromere, distance to telomere,  proximity to GC and AT rich regions, association with tertiary DNA structures, association with fragile sites. Are there tools that can give me a 'your sequence is associated with these genomic areas' type tool or do I have to do each seperately?

ADD COMMENTlink modified 4.5 years ago by Alex Reynolds28k • written 4.5 years ago by sebastianzeki0170
2
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

This might be a broad question. Here is one answer to your question about distance to centromeres that uses operations with BEDOPS tools and UCSC-formatted BED files, which might help you think about your set of questions in more general terms.

First, assuming you are working with human data, generate a file called centromeres.bed that contains genomic ranges for centromeres:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz | gunzip -c | grep acen > centromeres.bed

Next, say you have a file containing sequencing reads called reads.bam. We convert it to a sorted BED file using convert2bed:

$ convert2bed -i bam < reads.bam > reads.bed

Finally, calculate the signed distance between each read and its nearest centromere with closest-features:

$ closest-features --closest --dist reads.bed centromeres.bed > distances_of_reads_to_closest_centromeres.bed

To now answer your other questions, given your reads (now in sorted BED form), you can think about using UCSC and other data sources to generate BED files that contain telomere regions, GC- and AT-enriched regions of interest, genomic regions that associate with tertiary DNA structure (e.g., ChIP-seq regions or motif binding sites), and regions associated with fragile sites. You could then do set and statistical operations with your reads against these regions using bedops, bedmap, closest-features and other tools in BEDOPS to help answer these and similar questions.

ADD COMMENTlink written 4.5 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1540 users visited in the last hour