Question: Extract Sequence From The Genome?
4
gravatar for Sam
8.1 years ago by
Sam90
Sam90 wrote:

Hello,

I have a report generated by some analysis tools that end up giving me chromosome start and end locations. Is there any tool out there that can quickly take the start/end locations and provide me with the sequence from the human genome?

Thanks

sequence retrieval • 14k views
ADD COMMENTlink modified 11 months ago by klues0090 • written 8.1 years ago by Sam90

duplicate of http://biostar.stackexchange.com/questions/56

ADD REPLYlink written 8.1 years ago by Pierre Lindenbaum115k

now it becomes How To Get The Sequence Of A Genomic Region From Ucsc?.

ADD REPLYlink written 4 weeks ago by hsiaoyi050440
6
gravatar for Pascal
8.1 years ago by
Pascal130
Pascal130 wrote:

Hi,

if you need sequences from many positions, I would recommend to set up biopieces. It requires to download and index an entire genome, but you can extract many sequences very fast.

http://code.google.com/p/biopieces/

ADD COMMENTlink written 8.1 years ago by Pascal130

Thanks for bringing this up, I was not aware if it. It reminded me of the old SEALS package.

ADD REPLYlink written 8.1 years ago by Alastair Kerr5.2k

www.biopieces.org

ADD REPLYlink written 7.6 years ago by Martin A Hansen3.0k
6
gravatar for Alastair Kerr
8.1 years ago by
Alastair Kerr5.2k
The University of Edinburgh, UK
Alastair Kerr5.2k wrote:

The Extract Genomic DNA under the 'fetch sequences' menu in Galaxy will do this. Remember to set the correct human assembly build when you upload your data and it will work automatically.

Galaxy is a great tool for working on coordinate based data and well worth learning.

ADD COMMENTlink written 8.1 years ago by Alastair Kerr5.2k
1

Yeah, if you are new to Galaxy you can see our introduction tutorial on that: http://www.openhelix.com/galaxy There are also some exercises that could get you started on using it.

ADD REPLYlink written 8.1 years ago by Mary11k

damn someone beat me to recommending Galaxy ;)

ADD REPLYlink written 8.1 years ago by Will4.5k
3
gravatar for Rm
8.1 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

You can use Entrez Programming Utilities

For example: To retrive "Homo sapiens chromosome Y" from nucleotide 1 to 90 on the reverse strand:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AC_000156&rettype=fasta&seq_start=1&seq_stop=90&strand=2

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

Database: db=nucleotide

Sequence or chr ID: id=AC_000156

Format: rettype=fasta

sequence Starting nucleotide: seq_start=1

Sequence End: seq_stop=90

Forward (1) or reverse strand(2) on chromosome: strand=2

gnl|ASM:GCF_000000025|Y:c90-1 Homo sapiens chromosome Y, alternate assembly HuRef, whole genome shotgun sequence CACCTGTAATCCCAGCACTTTGGGACACCGAGGTGGACAGATCACCTGAGGTCAGGAGTTCGAGACCAGC CTGGCCAACTTGGTGAAACC

EFetch: Retrieves records in the requested format from a list of one or more unique identifiers. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

ADD COMMENTlink written 8.1 years ago by Rm7.8k
2
gravatar for Joachim
8.1 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

If you like to retrieve the sequences automatically via a script or program, then you can also use Ensembl's DAS-server. Note that there are various coordinate systems due to various assemblies though and Ensembl currently uses GRCh37. However, you can access Ensembl's archives to query older versions of the genome.

Anyway, you can retrieve sequences by fetching an URL like:

http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,110000

This will give you the sequence from base-pairs 100000 to 110000 on the 1st chromosome. The abbreviated output is formatted as follows:

<DASSEQUENCE> 
<SEQUENCE id="1" start="100000" stop="110000" version="1.0"> 
cactaagcacacagagaataatgtctagaatctgagtgccatgttatcaaattgtactga
gactcttgcagtcacacaggctgacatgtaagcatcgccatgcctagtacagactctccc
...
</SEQUENCE> 
</DASSEQUENCE>
ADD COMMENTlink written 8.1 years ago by Joachim2.8k

Thanks for all the feedback! Galaxy seems to be just what I've been looking for

ADD REPLYlink written 8.1 years ago by Sam90
1
gravatar for Scott
8.1 years ago by
Scott10
Scott10 wrote:

The next update of NCBI2R (http://ncbi2r.wordpress.com) will have that feature as an R function called GetSequence. However that update won't be released until next week. It works by downloading sequence for an accession number, and can also handle chromosome and position based queries based on the current build of the genome.

disclaimer: it's my package. caveat: that version isn't released just yet. I'm hoping for next week to release it along with some other new functions in a new upgrade of the NCBI2R package.

ADD COMMENTlink written 8.1 years ago by Scott10
0
gravatar for klues009
11 months ago by
klues0090
klues0090 wrote:

Alternatively, I have ran into issues while doing this in R with the package biomaRt, so here's a work around function for ensembl:

getSeq_ensembl = 
  Vectorize(
    function(chromosome, start, end, strand, species = "Homo_sapiens"){
      url = paste0("https://useast.ensembl.org/", species, "/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=",
             chromosome, ":", start, "-", end, ";strand=", strand, 
             ";utr5=yes;cdna=yes;intron=yes;utr3=yes;peptide=yes;coding=yes;genomic=unmasked;exon=yes;_format=Text")
      Biostrings::DNAString(read.csv(url)[1,1])
  },
vectorize.args = c("chromosome", "start", "end", "strand", "species")
)
ADD COMMENTlink written 11 months ago by klues0090
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 585 users visited in the last hour