Question: retrieve 1000 bp upstream sequences
0
gravatar for spaul8505
4.9 years ago by
spaul850520
United States
spaul850520 wrote:

How to retrieve the upstream 100bp sequences in a reference genome like humans for instance? upstream of the 5' UTR?

rna-seq • 3.1k views
ADD COMMENTlink modified 4.9 years ago by Krisr460 • written 4.9 years ago by spaul850520
2

Upstream from what? The TSS of each gene?

ADD REPLYlink written 4.9 years ago by James Ashmore3.0k
5
gravatar for James Ashmore
4.9 years ago by
James Ashmore3.0k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore3.0k wrote:

Here's my approach, although please note there are much easier ways to do this using the programming language R and the associated packages. If you feel like trying that, have a look at this question for a general guide.

Download coordinates of 5' UTR exons

Go to the UCSC website and click on 'Table Browser' in the left-hand column. This will take you to a page where you can download various genome annotations. Enter your genome of interest, what assembly you require, and what track you prefer. Change output format to 'BED - browser extensible data' and then click on the 'get output' button. A new page will load and from here you can choose to create one BED record per feature - in your case click the '5' UTR exons' button. Then click the 'get BED' button to download a file of all the 5' UTR exons.

Get coordinates of 1000 bp region upstream of 5 'UTR exons

You now need to get the 1000 bp upstream coordinates. To do this I advise you to use bedtools, which is a command-line tool to query BED files. This approach will require you to have a file which lists the size of each chromosome in your genome. You can use the fetchChromSizes script from UCSC for this purpose.

fetchChromSizes hg38 > hg38.txt
bedtools flank -l 1000 -i exons.bed -g hg38.txt > upstream.bed

Get sequences of upstream regions

Now you need to download the reference genome you specified on the UCSC website (I normally download them from Illumina's iGenomes page). With that done, you can use the following bedtools command to retrieve your sequences:

bedtools getfasta -fi hg38.fasta -bed upstream.bed -fo sequences.fasta
ADD COMMENTlink written 4.9 years ago by James Ashmore3.0k

Current version needs -r

bedtools flank -l 1000 -r 0  -i exons.bed -g hg38.txt > upstream.bed
ADD REPLYlink modified 16 days ago • written 16 days ago by shenwei3565.7k
1
gravatar for kapil.joshi036
4.9 years ago by
Student ,School of life sciences, Manipal University, Manipal, India
kapil.joshi03680 wrote:

GOTO the UCSC and Tools->TAble Browser -> select parameters -> and then change output to sequence -> get output -> add 100bp upstrim keep everything as it its

ADD COMMENTlink written 4.9 years ago by kapil.joshi03680

This solution does not make sense.

ADD REPLYlink written 4.9 years ago by jotan1.2k
0
gravatar for Krisr
4.9 years ago by
Krisr460
United States
Krisr460 wrote:

If you have many sequences to retrieve you can use the slice tool under the API tools from Ensembl.

http://www.ensembl.org/info/docs/api/core/core_tutorial.html#coordinates

ADD COMMENTlink written 4.9 years ago by Krisr460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2703 users visited in the last hour
_