Python formatting when visualizing Primer3-py dimers
1
0
Entering edit mode
13 months ago
bhumm ▴ 140

I am currently creating a program to analyze primers prior to multiplexing. I am interested in visualizing homo/heterodimers and hairpin structures. To do this I am using Primer3-py bindings. I am able to get a nice visual representation of dimers like so:

z = primer3.bindings.calcHeterodimer('TGACACCGCCAAGGTGAATTT', 'CCGCTCCGTGGTTGGTCCGGTGGCGAGCGG', output_structure = True).ascii_structure_lines
print([i.split('\t')[1] for i in z])

With an output like:

['     TGA     ------   A  T AATTT',
 '        CACCG      CCA GG G     ',
 '        GTGGC      GGT CC C     ',
 'GGCGAGCG     CTGGTT   G  T GCC--']

When I perform this on a larger set of primers and try to incorporate this into a dataframe and export to a csv, I (unsurprisingly) lose the format. Here is an example:

df = pd.concat(map(pd.Series, [z]), axis=1)


test1,  ['     TGA     ------   A  T AATTT', '        CACCG      CCA GG G     ', '        GTGGC      GGT CC C     ', 'GGCGAGCG     CTGGTT   G  T GCC--']

Is there anyway to retain this format within a dataframe exported to a csv?

dictionary primer3-py python • 1.6k views
ADD COMMENT
0
Entering edit mode

You should include all your code to produce your toy example so that others can pick up with what you have worked. And then more likely you'll get back something closer to what you had and you'll understand it better.

Specifically, I cannot get what you show you are getting to even begin to be able to show you how you can get what I thought you may be asking for.
Here's what I cobbled together based on your post:

%pip install primer3-py
import primer3
t_str_noms = ["test1","test2","test3","test4"]
d = {}
for t in t_str_noms:
    d[t] = [i.split('\t')[1] for i in primer3.bindings.calc_heterodimer('TGACACCGCCAAGGTGAATTT', 'CCGCTCCGTGGTTGGTCCGGTGGCGAGCGG', output_structure = True).ascii_structure_lines]
import pandas as pd
df = pd.concat(map(pd.Series, [d]), axis=1)
print(df.to_string())

(Ignore the first line if you aren't using Jupyter. I'm running that code in the temporary session that comes up after clicking on the 'launch binder' badge here.)

(By the way, Python the code block above was just my guess at how you may have gone from your earlier code block to the block where df = pd.concat(map(pd.Series, [d]), axis=1). I'm in no way endorsing what I cobbled together as a viable way to do those steps. I'm only attempting to reverse engineer what you may have done.)

What I get is close to yours but not quite:

                                                                                                                                  0
test1  [     TGA     ------   A  T AATTT,         CACCG      CCA GG G,         GTGGC      GGT CC C, GGCGAGCG     CTGGTT   G  T GCC--]
test2  [     TGA     ------   A  T AATTT,         CACCG      CCA GG G,         GTGGC      GGT CC C, GGCGAGCG     CTGGTT   G  T GCC--]
test3  [     TGA     ------   A  T AATTT,         CACCG      CCA GG G,         GTGGC      GGT CC C, GGCGAGCG     CTGGTT   G  T GCC--]
test4  [     TGA     ------   A  T AATTT,         CACCG      CCA GG G,         GTGGC      GGT CC C, GGCGAGCG     CTGGTT   G  T GCC--]

That brings me to the main issue...

What are you actually expecting from your question?

"Is there anyway to retain this format within a dataframe exported to a csv?

Is "this format " what you show just above or what was more above? And they seem to be have the same elements in them. What exactly is different that you want?

Minor: You are using and posting deprecated code in the way you are using primer3.bindings.calcHeterodimer(). Please look into things when you see such warnings and use updated syntax so that you have an easier time down the line. Eventually, in later versions of the module, the deprecated versions will cease to work and there won't be a warning. Specifically, in this case the use of camel case for the name of the functions has been determined to be a convention they are ceasing to support in future versions, see here. here, and "NOTE. camelCase methods are deprecated" noted at the bottom of this section of the documentation here.

ADD REPLY
0
Entering edit mode

Hi Wayne,

Thanks for the response. I have updated the code to be more representative of the toy example I have provided - apologies for any confusion.

In practice I will have a substantial list of primers to be used in a multiplex where I want to ensure no heterodimers will form.

The format I am referring to is in the first output shown. Here it is again:

['     TGA     ------   A  T AATTT',
'        CACCG      CCA GG G     ',
'        GTGGC      GGT CC C     ',
'GGCGAGCG     CTGGTT   G  T GCC--']

As you stated, all the elements are retained in the dataframe. However, this representation is ideal for the less 'coding-minded' folks who I intend on sharing this information which leads to the desire to keep the format I am showing above. For clarity, the format I desire if possible is the 'alignment' of the hybridized nucleotides as seen above. I've tried methods such as the example below which did not work.

['     TGA     ------   A  T AATTT', \n
'        CACCG      CCA GG G     ', \n
'        GTGGC      GGT CC C     ', \n
'GGCGAGCG     CTGGTT   G  T GCC--']

So, more clearly, I am trying to retain the formatting of the base alignment for the predicted dimers when the list is integrated into a dataframe and subsequently exported to a csv.

Finally, I am aware that I am using deprecated bindings. The broader group I am working with on other elements of the project are still on the previous version so we will have to update all of our code away from the camelCase methods in the future. Thanks for recognizing this.

Thanks again!

ADD REPLY
1
Entering edit mode

Starting to understand what you want and why.

My question is now also then do you actually want the brackets? They seem to shift the relative next lines. Shouldn't it be something like:

'     TGA     ------   A  T AATTT',
'        CACCG      CCA GG G     ',
'        GTGGC      GGT CC C     ',
'GGCGAGCG     CTGGTT   G  T GCC--'

That way you can see the corresponding lines better?

How do you want the name, like 'test1' stuff, handled in this or no names? I'm trying to think about scaling since you say, " a substantial list of primers" More importantly, it sounds like you want this retained in a CSV file as text in the end? Is that how it will go to the less coding-minding folks?

I'll try to even specify the version then since you are stuck with that.


Here's some of that based on what I had built on what you had originally since it was making the dataframe actually have a Python list in one column which it normally doesn't want to do, i.e., instead of typical df = pd.DataFrame.from_dict(d, orient='index',):

%pip install primer3-py==0.6.1
import primer3
t_str_noms = ["test1","test2","test3","test4"]
d = {}
for t in t_str_noms:
    d[t] = [i.split('\t')[1] for i in primer3.bindings.calcHeterodimer('TGACACCGCCAAGGTGAATTT', 'CCGCTCCGTGGTTGGTCCGGTGGCGAGCGG', output_structure = True).ascii_structure_lines]
import pandas as pd
df = pd.concat(map(pd.Series, [d]), axis=1)
df = df.explode(0) # based on https://stackoverflow.com/a/66732712/8508004
print(df.to_string())

That gives:

                                      0
test1       TGA     ------   A  T AATTT
test1          CACCG      CCA GG G     
test1          GTGGC      GGT CC C     
test1  GGCGAGCG     CTGGTT   G  T GCC--
test2       TGA     ------   A  T AATTT
test2          CACCG      CCA GG G     
test2          GTGGC      GGT CC C     
test2  GGCGAGCG     CTGGTT   G  T GCC--
test3       TGA     ------   A  T AATTT
test3          CACCG      CCA GG G     
test3          GTGGC      GGT CC C     
test3  GGCGAGCG     CTGGTT   G  T GCC--
test4       TGA     ------   A  T AATTT
test4          CACCG      CCA GG G     
test4          GTGGC      GGT CC C     
test4  GGCGAGCG     CTGGTT   G  T GCC--
ADD REPLY
0
Entering edit mode

Glad I am making this more clear. The 'test1' in practice will be the 'primer name' to allow for tracking of other metrics (Tm, hairpin structure, etc.) and will be used as the index of the dataframe to aggregate all the data. The primer names and sequences are stored in a dictionary that I iterate through and perform all the possible heterodimer calculations for each combination. This dataframe will then be exported as a csv and serve as a report that can be shared with others.

Yes elimination of the brackets is completely fine and indeed do perturb the alignment!

I am using primer3-py version 0.6.1.

ADD REPLY
2
Entering edit mode
13 months ago
Wayne ★ 1.9k

Try this so you'll see that using tabulate to print the dataframe causes it to respect the line breaks separating the strings so that you can view it from Python and you'll also find Excel viewing the CSV file saved from the dataframe respects the line breaks in that entry, too:

%pip install primer3-py==0.6.1
%pip install tabulate
import primer3
t_str_noms = ["test1","test2","test3","test4"]
d = {}
for t in t_str_noms:
    item_list = [i.split('\t')[1] for i in primer3.bindings.calcHeterodimer('TGACACCGCCAAGGTGAATTT', 'CCGCTCCGTGGTTGGTCCGGTGGCGAGCGG', output_structure = True).ascii_structure_lines]
    d[t] ='\n'.join(item_list)
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index')
#print(df.to_string())
from tabulate import tabulate
print(tabulate(df)) # use based on https://stackoverflow.com/a/49739927/8508004 and https://github.com/astanin/python-tabulate#multiline-cells
df.to_csv("test.csv")

Gives as text from printing in Python, the following:

-----  --------------------------------
test1  TGA     ------   A  T AATTT
               CACCG      CCA GG G
               GTGGC      GGT CC C
       GGCGAGCG     CTGGTT   G  T GCC--
test2  TGA     ------   A  T AATTT
               CACCG      CCA GG G
               GTGGC      GGT CC C
       GGCGAGCG     CTGGTT   G  T GCC--
test3  TGA     ------   A  T AATTT
               CACCG      CCA GG G
               GTGGC      GGT CC C
       GGCGAGCG     CTGGTT   G  T GCC--
test4  TGA     ------   A  T AATTT
               CACCG      CCA GG G
               GTGGC      GGT CC C
       GGCGAGCG     CTGGTT   G  T GCC--
-----  --------------------------------

And when that CSV file is opened in Excel and the font set to the Courier font and the second column border dragged to the right a little, it looks like:

from_tabulate_route

What is going on behind the scenes here in case you wanted to control it a more directly?

Examining the .csv file produced from the Python code in your favorite text editor can help gain insight:

enter image description here

Looking at the csv file produced in you get a sense that double quotes around the string are important.
As shown in here the trick is getting the string with line breaks saved in the `.CSV file with double quotes surrounding it, which came from a link in an answer to 'Python: Add Line breaks into Excel cells while exporting the DataFrame'.





A purely Pandas way using pandas.DataFrame.explode:

%pip install primer3-py==0.6.1
import primer3
t_str_noms = ["test1","test2","test3","test4"]
d = {}
for t in t_str_noms:
    d[t] = [i.split('\t')[1] for i in primer3.bindings.calcHeterodimer('TGACACCGCCAAGGTGAATTT', 'CCGCTCCGTGGTTGGTCCGGTGGCGAGCGG', output_structure = True).ascii_structure_lines]
import pandas as pd
df = pd.concat(map(pd.Series, [d]), axis=1)
df = df.explode(0) # based on https://stackoverflow.com/a/66732712/8508004
print(df.to_string())
df.to_csv("test_from_tabulate.csv")

That gives:

                                      0
test1       TGA     ------   A  T AATTT
test1          CACCG      CCA GG G     
test1          GTGGC      GGT CC C     
test1  GGCGAGCG     CTGGTT   G  T GCC--
test2       TGA     ------   A  T AATTT
test2          CACCG      CCA GG G     
test2          GTGGC      GGT CC C     
test2  GGCGAGCG     CTGGTT   G  T GCC--
test3       TGA     ------   A  T AATTT
test3          CACCG      CCA GG G     
test3          GTGGC      GGT CC C     
test3  GGCGAGCG     CTGGTT   G  T GCC--
test4       TGA     ------   A  T AATTT
test4          CACCG      CCA GG G     
test4          GTGGC      GGT CC C     
test4  GGCGAGCG     CTGGTT   G  T GCC--

And when that CSV file is opened in Excel and the font set to Courier font, it looks like:

from_pure_pandas_route

The pure Pandas one is based on what you had originally since it was making the dataframe actually have a Python list in a single column, which Pandas normally doesn't want to do when you use typical Pandas routes to making a dataframe from a dictionary, i.e., df = pd.DataFrame.from_dict(d, orient='index',). In the typical way, Pandas will normally put each list item in a separate column itself.





If anyone wants to test these code blocks, they'll work right in your browser without needing to install anything on your computer or login/resister by using remote temporary machines served via the MyBinder.or service:
Try running either code block in the Jupyter notebook file (or create a new one) that comes up in the temporary session after clicking on the 'launch binder' badge here.





NOTE: Anyone using this code with the current versions of primer3-py will use primer3.bindings.calc_heterodimer( in place of primer3.bindings.calcHeterodimer(. This is because following version 0.6.1 the developers deprecated the use of camel case for the function names in primer-py, as noted at the bottom of this section of the documentation here.

ADD COMMENT
1
Entering edit mode

Wayne, both of these methods work beautifully. I really appreciate the detail you provided, thank you so much for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6