Question: Retrieve 3'Utr Sequences For A (Complete) List Of Genes Using Biomart/Ucsc
1
gravatar for Agatha
7.0 years ago by
Agatha340
Agatha340 wrote:

I am trying to retrieve the 3'UTRs for the list of gene ids from this dataset:

http://cbio.mskcc.org/saturation/NBT-RA20689B/NBT-RA20689B_SuppTable1.xls).

I have managed to get the sequences for about 1300 (out of ~20000) with Biomart using DBASS5 Gene Name as filter.

I have also used the Table Browser from UCSC..Some of the id's (~3000, not sure which) are not compatible with the refseq gene ids from their repository. The ids returned are in the following format:

hg19refGeneNM_032291 range=chr1:67208779-67210768 5'pad=0 3'pad=0 strand=+ repeatMasking=none

, and since I do not know which ones are not valid I cannot map them back to my gene set

How can I get the complete list of 3UTRs for this gene list?

ensembl biomart ucsc • 9.2k views
ADD COMMENTlink modified 7.0 years ago by Joachim2.8k • written 7.0 years ago by Agatha340

A little confused as to how you retrieved UTRs for 1300 genes using only one gene name? Perhaps describe more exactly what you did in BioMart?

ADD REPLYlink written 7.0 years ago by Neilfws48k

I have uploaded a file containing the gene names..That is the filter type ..for the type of genes, in the drop down list of the Filters/ID_list_limit. Unfortunately, the data contains some gene ids which are not recognized by that filter. I have uploaded them in the Genome Browser and apparently around 3000 of them are not recognized as ref seq gene ids

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Agatha340

Are you interested in pulling out the sequences of the actual 3'UTR or would it suffice to retrieve a specific length of sequence after every STOP in the coding region? If it's the latter, I can suggest a way to do it in galaxy.

ADD REPLYlink written 7.0 years ago by Jason880

I need the sequences for the actual UTR for motif finding .

ADD REPLYlink written 7.0 years ago by Agatha340

i am interested to retrieve 3'UTR region from all the reported genes og buffalo. how can i do this?

ADD REPLYlink written 4.1 years ago by harpreetmanku040
2
gravatar for Joachim
7.0 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

Hi!

If I get it right, then the supplemental data comes from the Nature article "Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs" where the authors carried out experiments on human cell lines, so I assume that the gene names are HGNC symbols (cannot access the full-text right now). Below I describe what I did to get 18,457 3'UTRs from Ensembl's BioMart installation:

  1. Go to http://www.ensembl.org/biomart/martview
  2. Choose "Ensembl Genes 66" as database
  3. Choose "Homo sapiens genes" as dataset
  4. Click on "Attributes", select "Sequences"
  5. In the tabs below select "3'UTR", "Ensembl Gene ID" and "Associated Gene Name"
  6. Click on "Filters", in the "Gene" tab set the "ID list limit" filter to "HGNC symbol(s)"
  7. Upload the file with the gene names (as you did beforehand)
  8. Click on "Results"
  9. Select "Export all results to" "File", "FASTA" and "Unique results only", press "Go"

    The downloaded file was named "mart_export.txt", where I counted the returned results as follows:

    grep \> mart_export.txt | sort | uniq | wc -l

    The number of associated gene names in the FASTA file came down to 17,375, which I counted via:

    grep \> mart_export.txt | cut -d '|' -f 2 | sort | uniq | wc -l

    Unfortunately, the sequence download in BioMart 0.7 is not very reliable and I suggest you try downloading the same information multiple times until you have a couple of files that are of the same size. The file I got was 55,288,206 bytes in size, contained 18,457 entries, where the gene name list from the Excel files contains 20,401 gene names.

    Hope this helps.

Joachim

ADD COMMENTlink written 7.0 years ago by Joachim2.8k

Hi Joachim,

Thank you for your detailed post. I could see that BioMart 0.7 is not very reliable because I have tried different types of filters and I did get sequences every time ..(from 20 000 to 60 000). This time looks about right :-).. so, thank you.

ADD REPLYlink written 7.0 years ago by Agatha340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 900 users visited in the last hour