Question: Retrieve upstream sequences from a gene of interest for many species
1
gravatar for Polychaete
3.5 years ago by
Polychaete10
United States
Polychaete10 wrote:

Let's say I want to find the upstream sequence for ACTA1 for every species of, say, birds that have such data available.  There are 45 whole bird genomes along with annotations available here: http://gigadb.org/dataset/101000.  In total, there are around 50 bird species with genome assemblies available, but they don't seem to be all in one place.  Anyway, How can one get the upstream sequence from some gene, say ACTA1, for all 45 species available on gigadb, or for all the birds available on some other database (e.g. NCBI has 28 species: go to http://www.ncbi.nlm.nih.gov/gene and search (ortholog_gene_58[group]) AND "birds"[porgn:__txid8782])

I prefer answers that use the R programming language, but other approaches are welcome.

R genome • 1.6k views
ADD COMMENTlink modified 3.4 years ago by Chris S.270 • written 3.5 years ago by Polychaete10

Genomes are available here: http://phybirds.genomics.org.cn/genome.jsp

ADD REPLYlink written 3.5 years ago by onuralp180

Thanks for pointing out another place where many (though not all) bird genomes are available!  

ADD REPLYlink written 3.5 years ago by Polychaete10
2
gravatar for Chris S.
3.5 years ago by
Chris S.270
United States
Chris S.270 wrote:

You can use the location data on the Entrez Gene results page to format a E-fetch query in R.  For the first record with NC_006090.3 (39337939..39339803, complement), you can get the gene sequence using

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39337939&seq_stop=39339803&rettype=fasta&retmode=text"

#readLines(url) OR with Biostrings package
readDNAStringSet(url)

  A DNAStringSet instance of length 1
    width seq                                                                                                                               names               
[1]  1865 ATGTGTGACGAGGACGAGACCACCGCGCTCGTGTGCGACAACGGCTCCGGCCTGGTGAAGGCT...TGGATCACAAAGCAGGAGTACGATGAAGCCGGCCCATCCATTGTCCACCGTAAATGCTTCTAA gi|358485509:c393...

To get the upstream region on the minus strand,  use end to end + 100 (or however many upstream bases)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39339803&seq_stop=39339903&rettype=fasta&retmode=text"

The second record from chimney swifts does not have location data listed in the results table, but its on the HTML page as NW_009954494 (369738..371606)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369738&seq_stop=371606&rettype=fasta&retmode=text"

On the plus strand, use start-100 to start

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369638&seq_stop=369738&rettype=fasta&retmode=text"

 

ADD COMMENTlink modified 3.4 years ago • written 3.5 years ago by Chris S.270

 

This is nice.  Are you extracting the location data by manually looking through the HTML page, or are you accessing all that information (record id, start, stop, strand) by scraping the page?

ADD REPLYlink written 3.5 years ago by Polychaete10
1
gravatar for Chris S.
3.4 years ago by
Chris S.270
United States
Chris S.270 wrote:

You can get the Location data from the esummary results (all except tinamou).  Since you have to use the history server to pass results from esearch to esummary, you can use an R package like rentrez or call the E-utilities command line tools  below

doc <- xmlParse(system('esearch -query "(ortholog_gene_58[group]) AND birds[porgn:__txid8782])" -db gene | esummary', intern=TRUE))

 xmlToDataFrame(doc["//LocationHistType"])
  AnnotationRelease  AssemblyAccVer      ChrAccVer ChrStart  ChrStop
1                102 GCF_000002315.3    NC_006090.3 39339802 39337938
2                100 GCF_000696875.1 NW_009926046.1    48075    61716
3                100 GCF_000747805.1 NW_009954494.1   369737   371605
4                100 GCF_000692075.1 NW_009902071.1   425342   423455

 

ADD COMMENTlink written 3.4 years ago by Chris S.270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1160 users visited in the last hour