Finding TATAWAA in sequence
Entering edit mode
4.7 years ago
baski • 0


I have human gene BMP1 form BMP family.;g=ENSG00000168487;r=8:22164736-22212326. I'm trying to find TATAWAA sequence near promotor -1000bp . I tried to get -1000bp from the gene and use blast to find TATAWAA sequence, but nothing working.

I'm sure it should be any TATAWAA sequence there(but maybe i'm wrong). Could anyone help me or lead me how to find it ?

Thank you guys.

gene fasta • 1.6k views
Entering edit mode

Maybe I miss something crucial, but if I understood correctly you are looking for exact matches of TATATAA or TATAAAA in a short sequence, what about just getting the piece of DNA and cntrl-F using your favorite text editor?

Entering edit mode

Yeah you are right. I do that so, but i thought i missed something. Thank you.

Entering edit mode
4.7 years ago

I wrote a command-line genome browser, ASCIIGenome, that may be handy in cases like this to find sequence motifs in a reference fasta file and show them together with an annotation file.

First load the reference sequence and an annotation file (this can be a remote file. Fasta file must be indexed):

ASCIIGenome -fa /path/to/genome.fa \

Then at the command prompt issue these commands:

find ENSG00000168487
grep -i \tgene\t.*ENSG00000168487 gff3
seqRegex -iupac TATAWAA
zo 8
print seqRegex
seqRegex > matches.bed
save matches.png

Explained: Find the gene ENSG00000168487, for clarity only show the "gene" feature (grep ...). Then search the motif TATAWAA as iupac code, zoom out x times (e.g. 8 times) to see some matches in the sequence.

The matches can be show on screen (print seqRegex) and saved to file (print seqRegex). Finally save a picture as png.

The resulting image should like similar to this:

enter image description here

Entering edit mode
4.7 years ago

Just use the cross-platform and ultrafast FASTA/Q tookit SeqKit

  1. Geting GTF (time: 5s, memory: <1M)

    $ zcat Homo_sapiens.GRCh38.84.gtf.gz | grep BMP1 | awk '$3=="gene"' > BMP1.gtf

    Three records extracted:

    $ cat BMP1.gtf 
    2   ensembl_havana  gene    68865481    68871517    .   -   .   gene_id "ENSG00000163217"; gene_version "1"; gene_name "BMP10"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000129573"; havana_gene_version "1";
    X   ensembl_havana  gene    50910784    50916607    .   +   .   gene_id "ENSG00000130385"; gene_version "5"; gene_name "BMP15"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000021525"; havana_gene_version "2";
    8   ensembl_havana  gene    22164736    22212326    .   +   .   gene_id "ENSG00000168487"; gene_version "17"; gene_name "BMP1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000097761"; havana_gene_version "6";
  2. Extracting upstram sequence (FASTA index pre-built. time: 2.6s, memory: 983M)

    $ seqkit subseq --only-flank --up-stream 1000 --gtf BMP1.gtf hsa.fa  > BMP1.usf_1k.fa

    Gene locations were stored in sequence header. For exmaple, 2_68865481-68871517:-_usf:1000 means 1000bp upstream flanking sequence of region 68865481-68871517 in negative strand of chromosome 2.

    $ seqkit seq --name BMP1.usf_1k.fa 
    2_68865481-68871517:-_usf:1000 ENSG00000163217
    X_50910784-50916607:+_usf:1000 ENSG00000130385
    8_22164736-22212326:+_usf:1000 ENSG00000168487
  3. Locating pattern (time: 0.1s, memory: 0)

    $ seqkit locate --ignore-case --degenerate --pattern TATAWAA BMP1.usf_1k.fa  | column -t 
    seqID                           patternName  pattern  strand  start  end  matched
    2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  +       578    584  TATATAA
    2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  -       577    583  TATATAA

Login before adding your answer.

Traffic: 1781 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6