Retrieve Sequences Flanking Ends Of A List Of Genes
2
1
Entering edit mode
12.3 years ago
Abaldwin ▴ 10

I have a list of say ~100 genes and want to retrieve the 50 nucleotides that are just downstream of the ends of these genes. I do not have the chromosomal locations of these genes readily available. How do I automate the process of finding the position where the gene ends and get the sequence downstream of it?

I have no experience using bioinformatics tools and would appreciate any pointers.

For example, take the TRBV10-1 gene. I was able to locate it using BLAST and then can extend the sequence region displayed by changing the number in the link below. But I'm sure there's a better way.

http://www.ncbi.nlm.nih.gov/nucleotide/338858162?report=gbwithparts&from=563592&to=564041&RID=H9AFMH6C016

Thanks!

gene • 3.7k views
ADD COMMENT
3
Entering edit mode
12.3 years ago
Neilfws 49k

The solution to this kind of problem is almost always to use BioMart. Briefly:

  1. Go to MartView
  2. Choose database Ensembl Genes 65; dataset Homo sapiens genes
  3. Click "Filters" (left menu); expand "GENE"
  4. Check "ID list limit" and choose "HGNC symbol(s)" in drop-down menu
  5. Paste gene symbols or upload file, 1 per line (e.g. TRBV10-1)
  6. Click "Attributes" (left menu); check "Sequences"; expand "SEQUENCES"
  7. Check "Flank (gene)"; check "Downstream flank"; enter "50" as value
  8. Click "Results", menu top-left

Result for TRBV10-1:

>ENSG00000211717|ENST00000390364
CACAGTGCTGCACAGCTGCCTCCTCTCTGCACATAAAGGGCAGTTAGAAT

Repeat, refine and download as required.

Note that there are several options for "flank"; select the one most appropriate for you. Note also that you can search using other identifiers; I assumed from your example that HGNC symbols work best for you.

ADD COMMENT
0
Entering edit mode

Thanks! You saved me a lot of work!

ADD REPLY
0
Entering edit mode

If some of the genes in the list fail to be found (for unknown reasons), is it possible to know which ones failed? The results seem to be simply a list of sequences, shorter than the list of queries, with no reference to the corresponding query. I would like to match each result to the query. Thanks again.

ADD REPLY
0
Entering edit mode

There's an option "Header" at step 7 which should allow you to include the HGNC symbol as part of the fasta sequence header. Another option would be to use different attributes (not sequences) which should return a table with blank entries for queries that did not retrieve data.

ADD REPLY
0
Entering edit mode
12.3 years ago
Treylathe ▴ 950

You can also do this in the ucsc table browser and also galaxy, we have a quick on how to do this in galaxy here: http://blog.openhelix.eu/?p=9808F

ADD COMMENT

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6