Question: Trouble locating a known motif on (-) strand: any hints for a beginner?
0
2.6 years ago by
Christian0 wrote:

Hello everyone.

This question may be rather dumb, but at this point I got really confused and a little stuck. I am a master's student in molecular biology and relatively inexperienced when it comes to computational problems. However, I wanted to give it a try to increase my range of skill.

My problem is as follows: I have a set of PSSMs and used the tool PoSSuMSearch to find matches for my genome of interest. I am interested in the neighborhood of the matches I found, so I wanted to use the positional coordinates given from the PSSM search to extract my region of interest from a genome file. This works fine if the motif I found is located on the forward strand. Troubles arises somewhere in my handling of the reverse strand: According to the manual file from PoSSumSearch (see figure on page 17), the position for each motif is located on 'left-most position of the motif'. It should look something like this:

``````#      | motif_on_plus
#    - - - - - - - - - - ->  (+) strand
#    0 1 2 3 4 5 6 7 8 9     position in genome
#   <- - - - - - - - - -     (-) strand
#          | motif_on_minus
``````

In that case my motif on the (+) strand would range from position 1 to 1+length(motif). I expect my motif on the (-) strand to range from 3 to 3+length(motif). I use some Python to check if that is valid:

``````from Bio import SeqIO

# Examplary lines of my data file after some sorting and filtering steps
example_lines = ["GCGACGGC;8036;0.002681536;CGTGCCG;8058;0.003927515;+",
"CGGCAGCC;28770;0.002681536;GAAGGGC;28749;0.003927515;-"]

for line in example_lines:
seq, pos, _, _, _, _, strand = line.split(sep=";")
pos = int(pos)

print(strand)
print(seq)

if "+" in strand: # Prints corresponding region on (+) strand
print(genome.seq[pos:pos+len(seq)])
elif "-" in strand: # Prints corresponding region on (-) strand
print(genome.seq[pos:pos+len(seq)].reverse_complement())

print()
``````

My genome file is downloaded from public data bases (chromosome of Sinorhizobium meliloti). I expected to get matching sequences for both strands (or some sequences with slightly shifted frames due to different indexing), but from what I saw so far is that the sequences for the (-) strand are just going wild. The output for this specific case is as follows:

``````+
GCGACGGC
GCGACGGC

-
CGGCAGCC
CCGACGGC
``````

I have looked into whether the problem is with the reverse complement, e.g. that my motif match is given in forward orientation. But this also did not yield in the solution of my problem. Additionally, I checked a broader neighbor region for my motif on the (-) strand to see whether my motif was at least somewhere near. This also was not the case (except on apparently random occasions with short motifs rich in GC since my genome has GC content around 70%).

My question to you is: Did I make an obvious mistake? Is there something I completely overlooked? I am grateful for any hints you might have for me!

motif search python pssm genome • 618 views
modified 2.6 years ago by Bastien Hervé4.9k • written 2.6 years ago by Christian0

Try to investigate but you just have the reverse not the complement.

The reverse of `CGGCAGCC` is actually `CCGACGGC`

0
2.6 years ago by
Bastien Hervé4.9k
Karolinska Institutet, Sweden
Bastien Hervé4.9k wrote:

I think I got something :

I downloaded your genbank file and I look at position 28770 of the sequence, the sequence is actually `gccgtcgg` not `CGGCAGCC` like in your `example_lines` variable.

Instead of `print(seq)` write `print(genome.seq[pos:pos+len(seq)])` and you will see the problem

The minus strand line in your data file are already reverse complement compared to the reference sequence in your genbank file