Retrieve 3'Utr Sequences For A (Complete) List Of Genes Using Biomart/Ucsc
1
2
Entering edit mode
12.0 years ago
Agatha ▴ 350

I am trying to retrieve the 3'UTRs for the list of gene ids from this dataset:

http://cbio.mskcc.org/saturation/NBT-RA20689B/NBT-RA20689B_SuppTable1.xls).

I have managed to get the sequences for about 1300 (out of ~20000) with Biomart using DBASS5 Gene Name as filter.

I have also used the Table Browser from UCSC..Some of the id's (~3000, not sure which) are not compatible with the refseq gene ids from their repository. The ids returned are in the following format:

hg19refGeneNM_032291 range=chr1:67208779-67210768 5'pad=0 3'pad=0 strand=+ repeatMasking=none

, and since I do not know which ones are not valid I cannot map them back to my gene set

How can I get the complete list of 3UTRs for this gene list?

biomart ucsc ensembl • 14k views
ADD COMMENT
0
Entering edit mode

A little confused as to how you retrieved UTRs for 1300 genes using only one gene name? Perhaps describe more exactly what you did in BioMart?

ADD REPLY
0
Entering edit mode

I have uploaded a file containing the gene names..That is the filter type ..for the type of genes, in the drop down list of the Filters/ID_list_limit. Unfortunately, the data contains some gene ids which are not recognized by that filter. I have uploaded them in the Genome Browser and apparently around 3000 of them are not recognized as ref seq gene ids

ADD REPLY
0
Entering edit mode

Are you interested in pulling out the sequences of the actual 3'UTR or would it suffice to retrieve a specific length of sequence after every STOP in the coding region? If it's the latter, I can suggest a way to do it in galaxy.

ADD REPLY
0
Entering edit mode

I need the sequences for the actual UTR for motif finding .

ADD REPLY
0
Entering edit mode

i am interested to retrieve 3'UTR region from all the reported genes og buffalo. how can i do this?

ADD REPLY
3
Entering edit mode
12.0 years ago
Joachim ★ 2.9k

Hi!

If I get it right, then the supplemental data comes from the Nature article "Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs" where the authors carried out experiments on human cell lines, so I assume that the gene names are HGNC symbols (cannot access the full-text right now). Below I describe what I did to get 18,457 3'UTRs from Ensembl's BioMart installation:

  1. Go to http://www.ensembl.org/biomart/martview
  2. Choose "Ensembl Genes 66" as database
  3. Choose "Homo sapiens genes" as dataset
  4. Click on "Attributes", select "Sequences"
  5. In the tabs below select "3'UTR", "Ensembl Gene ID" and "Associated Gene Name"
  6. Click on "Filters", in the "Gene" tab set the "ID list limit" filter to "HGNC symbol(s)"
  7. Upload the file with the gene names (as you did beforehand)
  8. Click on "Results"
  9. Select "Export all results to" "File", "FASTA" and "Unique results only", press "Go"

    The downloaded file was named "mart_export.txt", where I counted the returned results as follows:

    grep \> mart_export.txt | sort | uniq | wc -l

    The number of associated gene names in the FASTA file came down to 17,375, which I counted via:

    grep \> mart_export.txt | cut -d '|' -f 2 | sort | uniq | wc -l

    Unfortunately, the sequence download in BioMart 0.7 is not very reliable and I suggest you try downloading the same information multiple times until you have a couple of files that are of the same size. The file I got was 55,288,206 bytes in size, contained 18,457 entries, where the gene name list from the Excel files contains 20,401 gene names.

    Hope this helps.

Joachim

ADD COMMENT
0
Entering edit mode

Hi Joachim,

Thank you for your detailed post. I could see that BioMart 0.7 is not very reliable because I have tried different types of filters and I did get sequences every time ..(from 20 000 to 60 000). This time looks about right :-).. so, thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6