Question

Retrieve Sequences Flanking Ends Of A List Of Genes

1

Entering edit mode

13.5 years ago

Abaldwin ▴ 10

I have a list of say ~100 genes and want to retrieve the 50 nucleotides that are just downstream of the ends of these genes. I do not have the chromosomal locations of these genes readily available. How do I automate the process of finding the position where the gene ends and get the sequence downstream of it?

I have no experience using bioinformatics tools and would appreciate any pointers.

For example, take the TRBV10-1 gene. I was able to locate it using BLAST and then can extend the sequence region displayed by changing the number in the link below. But I'm sure there's a better way.

http://www.ncbi.nlm.nih.gov/nucleotide/338858162?report=gbwithparts&from=563592&to=564041&RID=H9AFMH6C016

Thanks!

gene • 4.2k views

ADD COMMENT • link updated 13.5 years ago by Treylathe ▴ 950 • written 13.5 years ago by Abaldwin ▴ 10

score 3 · Answer 1 · 2012-01-17

3

Entering edit mode

13.5 years ago

Neilfws 49k

The solution to this kind of problem is almost always to use BioMart. Briefly:

Go to MartView
Choose database Ensembl Genes 65; dataset Homo sapiens genes
Click "Filters" (left menu); expand "GENE"
Check "ID list limit" and choose "HGNC symbol(s)" in drop-down menu
Paste gene symbols or upload file, 1 per line (e.g. TRBV10-1)
Click "Attributes" (left menu); check "Sequences"; expand "SEQUENCES"
Check "Flank (gene)"; check "Downstream flank"; enter "50" as value
Click "Results", menu top-left

Result for TRBV10-1:

>ENSG00000211717|ENST00000390364
CACAGTGCTGCACAGCTGCCTCCTCTCTGCACATAAAGGGCAGTTAGAAT

Repeat, refine and download as required.

Note that there are several options for "flank"; select the one most appropriate for you. Note also that you can search using other identifiers; I assumed from your example that HGNC symbols work best for you.

ADD COMMENT • link 13.5 years ago by Neilfws 49k

0

Entering edit mode

Thanks! You saved me a lot of work!

ADD REPLY • link 13.5 years ago by Abaldwin ▴ 10

0

Entering edit mode

If some of the genes in the list fail to be found (for unknown reasons), is it possible to know which ones failed? The results seem to be simply a list of sequences, shorter than the list of queries, with no reference to the corresponding query. I would like to match each result to the query. Thanks again.

ADD REPLY • link 13.5 years ago by Abaldwin ▴ 10

0

Entering edit mode

There's an option "Header" at step 7 which should allow you to include the HGNC symbol as part of the fasta sequence header. Another option would be to use different attributes (not sequences) which should return a table with blank entries for queries that did not retrieve data.

ADD REPLY • link 13.5 years ago by Neilfws 49k

score 0 · Answer 2 · 2012-01-17

0

Entering edit mode

13.5 years ago

Treylathe ▴ 950

You can also do this in the ucsc table browser and also galaxy, we have a quick on how to do this in galaxy here: http://blog.openhelix.eu/?p=9808F

ADD COMMENT • link 13.5 years ago by Treylathe ▴ 950