Filter out all known genes & regulatory elements for a given genome in a local blast search
1
0
Entering edit mode
5.7 years ago
Proteus00 • 0

I am developing a script that will count the number of times a short nucleotide sequence hits non coding regions of the human genome. Based on google searches, Blast+ appears to be the tool to use. They have a few cookbook recipes about masking a database with a FASTA files which I want to leverage.matthew_rich

I want to know if there is a way to pull all known transcripts for the human genome and put a 50-100bp buffer on the 5' and 3' ends (to avoid potential regulatory elements) and write those sequences to a file. I did not see anything on ncbi suggesting BLAST could do this task.

Does anyone have a suggestion on how to accomplish this task?

Thanks in advance.

genome sequence • 1.0k views
ADD COMMENT
0
Entering edit mode

You can download all cDNA sequences from Ensembl, not sure what you mean by the buffer sequences though.

ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/README

##################
Fasta cDNA dumps
#################

These files hold the cDNA sequences corresponding to Ensembl gene 
predictions. cDNA consists of transcript sequences for actual and possible
genes, including pseudogenes, NMD and the like. See the file names 
explanation below for different subsets of both known and predicted 
transcripts.
ADD REPLY
0
Entering edit mode

Sej already pointed out that you can download cDNA sequences directly. Still, I do not see any biological basis for this "buffer". Do you mean untranslated regions, or gene promoters? Please leave a comment with some more details.

ADD REPLY
0
Entering edit mode

Yes, extending the sequence beyond the stated gene is desirable to subsume any regulatory elements in the UTR for my mask file. I am trying to create a local blast database that represents a benign DNA, where any alterations would be presumed silent. My strategy for this would be to go through known genes and add additional bps to both ends to also block regulatory elements that my be near by. Also CDNA is undesirable since I would like to avoid all introns as well.

ADD REPLY
0
Entering edit mode
5.7 years ago
GenoMax 141k

I want to know if there is a way to pull all known transcripts for the human genome and put a 50-100bp buffer on the 5' and 3' ends (to avoid potential regulatory elements) and write those sequences to a file.

You can do that using BioMart (Click on BioMart link at top of page). A video tutorial is available here.

There are other ways of getting this information including UCSC table browser.

ADD COMMENT

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6