Question: Finding TATAWAA in sequence
0
gravatar for baski
3.6 years ago by
baski0
baski0 wrote:

Hello,

I have human gene BMP1 form BMP family.
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000168487;r=8:22164736-22212326. I'm trying to find TATAWAA sequence near promotor -1000bp . I tried to get -1000bp from the gene and use blast to find TATAWAA sequence, but nothing working.

I'm sure it should be any TATAWAA sequence there(but maybe i'm wrong). Could anyone help me or lead me how to find it ?

Thank you guys.

fasta gene • 1.3k views
ADD COMMENTlink modified 3.6 years ago by shenwei3565.2k • written 3.6 years ago by baski0

Maybe I miss something crucial, but if I understood correctly you are looking for exact matches of TATATAA or TATAAAA in a short sequence, what about just getting the piece of DNA and cntrl-F using your favorite text editor?

ADD REPLYlink written 3.6 years ago by WouterDeCoster43k

Yeah you are right. I do that so, but i thought i missed something. Thank you.

ADD REPLYlink written 3.6 years ago by baski0
1
gravatar for dariober
3.6 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

I wrote a command-line genome browser, ASCIIGenome, that may be handy in cases like this to find sequence motifs in a reference fasta file and show them together with an annotation file.

First load the reference sequence and an annotation file (this can be a remote file. Fasta file must be indexed):

ASCIIGenome -fa /path/to/genome.fa \
    ftp://ftp.ensembl.org/pub/release-86/gff3/homo_sapiens/Homo_sapiens.GRCh38.86.chromosome.8.gff3.gz

Then at the command prompt issue these commands:

find ENSG00000168487
grep -i \tgene\t.*ENSG00000168487 gff3
seqRegex -iupac TATAWAA
zo 8
print seqRegex
seqRegex > matches.bed
save matches.png

Explained: Find the gene ENSG00000168487, for clarity only show the "gene" feature (grep ...). Then search the motif TATAWAA as iupac code, zoom out x times (e.g. 8 times) to see some matches in the sequence.

The matches can be show on screen (print seqRegex) and saved to file (print seqRegex). Finally save a picture as png.

The resulting image should like similar to this:

enter image description here

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by dariober11k
1
gravatar for shenwei356
3.6 years ago by
shenwei3565.2k
China
shenwei3565.2k wrote:

Just use the cross-platform and ultrafast FASTA/Q tookit SeqKit

  1. Geting GTF (time: 5s, memory: <1M)

    $ zcat Homo_sapiens.GRCh38.84.gtf.gz | grep BMP1 | awk '$3=="gene"' > BMP1.gtf
    

    Three records extracted:

    $ cat BMP1.gtf 
    2   ensembl_havana  gene    68865481    68871517    .   -   .   gene_id "ENSG00000163217"; gene_version "1"; gene_name "BMP10"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000129573"; havana_gene_version "1";
    X   ensembl_havana  gene    50910784    50916607    .   +   .   gene_id "ENSG00000130385"; gene_version "5"; gene_name "BMP15"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000021525"; havana_gene_version "2";
    8   ensembl_havana  gene    22164736    22212326    .   +   .   gene_id "ENSG00000168487"; gene_version "17"; gene_name "BMP1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000097761"; havana_gene_version "6";
    
  2. Extracting upstram sequence (FASTA index pre-built. time: 2.6s, memory: 983M)

    $ seqkit subseq --only-flank --up-stream 1000 --gtf BMP1.gtf hsa.fa  > BMP1.usf_1k.fa
    

    Gene locations were stored in sequence header. For exmaple, 2_68865481-68871517:-_usf:1000 means 1000bp upstream flanking sequence of region 68865481-68871517 in negative strand of chromosome 2.

    $ seqkit seq --name BMP1.usf_1k.fa 
    2_68865481-68871517:-_usf:1000 ENSG00000163217
    X_50910784-50916607:+_usf:1000 ENSG00000130385
    8_22164736-22212326:+_usf:1000 ENSG00000168487
    
  3. Locating pattern (time: 0.1s, memory: 0)

    $ seqkit locate --ignore-case --degenerate --pattern TATAWAA BMP1.usf_1k.fa  | column -t 
    seqID                           patternName  pattern  strand  start  end  matched
    2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  +       578    584  TATATAA
    2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  -       577    583  TATATAA
    
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by shenwei3565.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 850 users visited in the last hour