Extracting An Up Stream Or Downstream Sequence From Given Position
Entering edit mode
13.7 years ago
Janake ▴ 170

I need to extract an upstream and a downstream sequences flanking putative small RNA regions. Does anyone know a python script to this? Thanks,

small data genomics sequence • 8.6k views
Entering edit mode

Unclear: what is a "small RNA region"? A segment of chromosome encoding putative small RNA transcript?

Entering edit mode

I am trying to find conserved small RNA in using ESTs. I have blast known miRNAs to the EST collection. I have the locations where miRNA match to each EST. Now I need to get upstream and downstream regions from these ESTs and see if they fold like miRNA precursors. Hope this is clear.

Entering edit mode

I am working a similar project. Please message me if you still need help. I'm a student at CWRU - Biology Dept.

Entering edit mode
13.7 years ago
Mary 11k

Is it internal/unpublished sequence or from one of the existing genomes? If it is a sequenced genome, this is possible from either BioMart or UCSC Genome Browser style browser. If it is an internal project, the script jockeys can help you I'm sure.

BioMart (or the specific Mart for your species if it's out there):

  1. Go to BioMart.org
  2. Click Martview
  3. Select the database and data set.
  4. In Filters, set the region. Use this item to enter your list: Multiple Chromosomal Regions (Chr:Start:End:Strand)
  5. In Attributes, choose Sequences. You also need to pick a nucleotide box* here, but I'm not sure which for this application.
  6. Set and upstream and downstream flank.
  7. Click results.

*Although now that I look at Sequences I'm not sure which radio to choose for this purpose. I'm going to have to ask them that. I think I'd choose exon for now.

UCSC: http://genome.ucsc.edu/

  1. Go to the Table Browser
  2. Set appropriate species/assembly stuff.
  3. On Region line, set items in Define Region button.
  4. Choose output format = sequence
  5. click Get Output.
  6. On next page say Genomic
  7. Set upstream/downstream length
  8. Get Sequence button

Or you can do all of either one from Galaxy (http://www.usegalaxy.org), and you can store that as a workflow you can always go back to later if you need it again. There is also a "Get Flanks" in the Galaxy menu "Operate on Genomic Intervals".

Entering edit mode

Thanks Mary. However, this is not a sequenced genome, but I will keep in mind your method for future use. Thanks a lot again.Kiriya

Entering edit mode
13.7 years ago
brentp 24k

here's a "script-jockey" python solution (untested, but should get you close).

if, in your blast the est is the subject, then you can substitute query with subject and q(start|end) with s(start|end)

import sys
from pyfasta import Fasta

est_fasta_file = sys.argv[1]

# i'll assume it's tab-delimited...
# and est is the query.
est_mirna_blast = sys.argv[2]

# take 100bp up/down-stream.
xstream = 100

est_fasta = Fasta(est_fasta_file)

for line in open(est_mirna_blast):
    # convert to int and 0-based coords.
    qstart, qstop, sstart, sstop = [int(x) - 1 for x in line.split("\t")[6:10]]
    query, subject = line.split("\t")[:2]
    up = min(0, qstart - xstream)
    down = qstop + xstream + 1

    est_upstream = est_fasta[query][up:qstart]
    est_dowstream = est_fasta[query][qstop:down]

    print ">%s_up" % query
    print est_upstream
    print ">%s_down" % query
    print est_dowstream
Entering edit mode

I'm running ubuntu and am having trouble running pyfasta... any suggestions? I am unfamiliar with python, so I am not sure how to "make" "install".

Entering edit mode

@Moss just run "sudo python setup.py install" or if you have setuptools: "sudo easy_install pyfasta"

if you have any troubles. send me an email bpederse[?]gmail.com

Entering edit mode

Hello brentp I would like to do same thing & I tried the above script you wrote. it gives me the following error message while I am running it

python seq_ext_coordinate.py 
Traceback (most recent call last):
  File "seq_ext_coordinate.py", line 5, in <module>
    EST_030516_lnr_tab.fa = sys.argv[1]
IndexError: list index out of range

Could you please tell me what I did wrong?


Login before adding your answer.

Traffic: 1114 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6