Question: Find binding sites start and end positions in genome using biophyton
0
gravatar for PaSua
6 months ago by
PaSua0
PaSua0 wrote:

Hello everyone, Python newby here.

I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.

Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.

Thank you in advance.

ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 6 months ago by PaSua0

If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2

Edit : Lactococcus lactis, 2.5M bases

You can give a try to BioPython and check the running time

ADD REPLYlink modified 6 months ago • written 6 months ago by Bastien Hervé4.4k

Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as fimo from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.

ADD REPLYlink written 6 months ago by ATpoint23k

BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.

(You needn't use BLAST via python, but you could if you wanted to).

ADD REPLYlink modified 6 months ago • written 6 months ago by Joe14k

you can use seqkit locate @ PaSua

Example input:

$ cat test.fa 
>a
TGTAAACCTTTTCATACTEAAGATTTGTAAACCTTTTCATGACCGTAGTGTAAACCTTTTCA
>b
ATCGATGCGATTGTAAACCTTTTCAATGCGATGACTGTAAACCTTTTCA

output:

$ seqkit locate -idp "TGTAAACCTTTTCA" test.fa

seqID   patternName pattern strand  start   end matched
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   1   14  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   26  39  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   49  62  TGTAAACCTTTTCA
b   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   12  25  TGTAAACCTTTTCA
ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1920 users visited in the last hour