Advice about building a computational project to investigate porphyrin’s roles in cancer survival for a newbie in bioinformatics
Entering edit mode
5 months ago
Thuan • 0

TL;DR An inexperienced biology major needing some advice about building a computational project on deciphering porphyrin’s roles over the summer and the first steps to take.

Hi everyone,

I am really in need of advice to start a computational project. First, I think it is helpful to give some context. I have recently found out about Bioinformatics, and I am strongly passionate about it, and I want to apply for a graduate program that is related to Bioinformatics.

The point is I am about to enter my Junior year, and I feel like I need to do something. I am not really good at Bioinformatics/coding or anything (I am a biology-related major), but I am willing to spend this summer learning. I cold emailed a professor, and she was very welcoming and said that she wanted me to try to attempt working independently on a computational project over the summer. Basically, she suggested that by employing data mining, I need to come up with a computational project to decipher the roles of porphyrins. She also provided some papers and background and said that her team hypothesized that porphyrins have an undefined yet essential role in cancer survival. I also think she knows I am not an expert, so I would assume she wanted me to brainstorm and think up a method/solution to the problem first before actually carrying it out.

As I stated, I am kind of a newbie. The only things I have are some background in Python and plenty of time in the summer. I honestly don’t want to be spoon-fed the whole project idea and I want to really try to put myself through hardships to learn if that makes sense, but I am genuinely lost here and do not know where to begin.

Does anyone familiar with data mining and how to approach a problem like this? Is there anything that you would suggest I look into first or the first steps I need to take? What does a project look like if the goal is to decipher and analyze a biological compound’s functions? What machine learning skills are needed to do this project?

Or is this problem really hard for a newbie like me and do you think I could still do it in around 2 and a half months in the summer? Maybe she misunderstood and thought I was really good at data science/machine learning or programming and gave me this, but I don’t really know.

Thank you!!

machine-learning python data-mining • 634 views
Entering edit mode

Look into the TCGA dataset. Lots of publicly available data (e.g. RNAseq) there. You can focus specifically on porphyrin metabolism and see if any enzymes' gene expression are correlated with clinical outcomes, expression of certain oncogenes, certain mutations, cancer types/subtypes, etc. as a good starting point.

Look into papers too and inspect their datasets (omics datasets are rich with information and basically always, papers don't extract everything there is to learn about a dataset -- so explore, ask questions, generate hypotheses, and you can probably cook up a good research project with some good findings).

To start off, simple things like calculating correlations, doing statistical tests, doing differential expression, making plots in python, making heatmaps, clustering samples, etc. can take you a long way.

Maybe in 2.5 months, you can put together a poster or abstract.

With a B.S. in biology and some wet lab training, I studied the role of oncogenes in cancer before grad school. My best, proudest independent work to date has been asking the question of why an oncogene causes cancer no matter what cell type it's activated in, formulating a hypothesis that there must be something in common related to genes involved in tissue lineage, and telling a story + publishing findings (in a respected journal) that explain a "gap" in the current literature. 90% of the project was computational. See

A biology background helped me a lot with asking the right questions (so attend lab meetings, participate in your lab's journal clubs, etc.). A lot of the papers in the highest impact journals are simple data science done on the right questions on the right data.

Now a lot my work is different (related to data structures, algorithms, statistical models, developing sequencing data processing software, fundamental questions about gene structure, etc.) but I still do biology data science here and there, and that past experience was still incredibly valuable.

Entering edit mode

Porphyrins are usually functionalized with something: I'm pretty sure they know that, and maybe you too. I think you might start with bibliographic research from Scholar and PubMed looking for articles involving treatment of cancer cell lines with porphyrins (take this for example, where they used Pyp with Manganese). Look for 1,2 mostly common functionalizations with some application in cancer research. If you find any, check if there's a study which uses some -omics approach (RNAseq, ChIP-seq, metabolomics): pick one dataset, then use tons of public resources (even the BioStars handbook) to make some knowledge on how to analyze these data and, once you feel confident, try to reproduce authors' data.

Don't expect to come with something enough to write a paper in a month. You have to show motivation and ability to explore a scientific question, nothing more.


Login before adding your answer.

Traffic: 990 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6