Forum: MS Bioinformatics Project
gravatar for Moneeb Bajwa
18 months ago by
Delaware, USA
Moneeb Bajwa0 wrote:

Hi, I know others have asked similar questions, but they were not particularly what I was looking for. I am Master's Bioinformatics student with a BA in Biology. I need a project idea that would last 3 months (Professor did not help). I was thinking something along the lines of getting sequencing data, trimming it, aligning it, and doing a differential expression analysis, but I don't know where to get the data for this (perhaps NCBI?). If you have other ideas for projects as well, that would be good too. I was also thinking about a project about something having to do with GWAS. My only experience in programming is what I learned in my Master's program.

Thanks for all your help

ADD COMMENTlink modified 18 months ago by Constantine240 • written 18 months ago by Moneeb Bajwa0

Do you have any experience with Bash scripting? It might be helpful to specify what programming you learned in your master's program. It would help if we had some more info about the goal of your project? Are you supposed to learn something about a specific tool, or contribute something unique to the community (etc.)?

ADD REPLYlink written 18 months ago by Andrew_MacGregor30

Yes I have bash scripting experience, perl, python, R, and wish to utilize these in the project. I have statistical analysis experience with R as well.

ADD REPLYlink written 18 months ago by Moneeb Bajwa0

I would just do something 'simple' like:

  1. Obtain RNA-seq FASTQ cancer cell-line data from the SRA, such as MCF7 breast cancer cell-iines with and without treatment
  2. Trim the reads by looking here: illumina quality trimming - FASTQC
  3. Then determine read count abundances over your samples with Kallisto
  4. Then read your Kallisto counts into DESeq2 by following Michael Love's great tutorial from Here or Here
  5. Then do a simple differential expression analysis with DESeq2

Do that and then Bob will be your uncle

Step 1 is Web browser; 2 and 3 are shell / BASH; 4 and 5 are R Programming Language

Of course, please come back here for help if needed.


ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe49k

Thank you!! I was thinking of doing the steps you mentioned above multiple times for various datasets to test some sort of hypothesis (because I need this project to last the 3 months mentioned in the question). Is there some sort of hypothesis I could test by comparing multiple datasets in this fashion?

ADD REPLYlink modified 18 months ago • written 18 months ago by Moneeb Bajwa0

Let me think. Is this your major project for your MSc? - like, how would you rate the importance of it? From my perspective, a Masters student should not have to necessarily add anything new to literature; thus, merely doing a re-analysis should be sufficient.

ADD REPLYlink written 18 months ago by Kevin Blighe49k

Yeah it's not an official paper or anything like that. But I just need to figure out how to extend it for an entire semester. This is really just something that I'm doing because I could not find an internship/co-op, so my Professor let me choose this option.

ADD REPLYlink written 18 months ago by Moneeb Bajwa0

Well, you could do something like download multiple cancer datasets and aim to come up with a 'pan-cancer' panel of markers (including non-coding RNAs). After you do your standard differential expression analyses to identify differentially expressed genes (DEGs), you could then do something 'cool' like refining the panel signature using lasso regression. I put some code here, which may help: A: How to exclude some of breast cancer subtypes just by looking at gene expressio

Here are some other posts of mine, which may help to give you further ideas:

ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe49k

Thanks!! But considering the read files are so large (500MB or more) wouldn't they be too big for expression analysis in R from my laptop? I have access to my school server for trimming and abundance estimates, but would the R-portion be able to be done completely on the server for differential expression analysis or would I need some to be done in R-studio?

ADD REPLYlink modified 18 months ago • written 18 months ago by Moneeb Bajwa0

500MB is peanuts these days... we now deal in at least gigabytes. The large projects deal with petabytes (ICGC, TCGA, 1000 Genomes).

If you have a relatively new laptop (?), 500MB should not be a problem. R should also be installed in your compute cluster environment, but that question needs to be directed to your local IT person, or directly to central IT. Installing a version of R for global use is not difficult - just depends on who the System Admin is. I've done it before whilst managing large clusters.

ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe49k

Thanks again! I am actually working on my laptop from home, and I connect to the school server through would I be able to do everything regarding the R-part on vim while on the server? Maybe I could save the graphs that I need as a PDF? Do I really need to use the R software interface?

ADD REPLYlink written 18 months ago by Moneeb Bajwa0

Are you worried about the plotting window not showing when you use R on the server? To get that part working, you need to set-up X Window (X11).

Are you using Windows? If 'yes', then install a program call Xming and leave it running in the background. Then, when you log-in to the server using [hopefully] PuTTY, go to: Connection > SSH > X11 and check the 'Enable X11 forwarding' checkbox, an also put into the text box the following: localhost:0

That will then transmit plotting windows from the server to your laptop.

ADD REPLYlink written 18 months ago by Kevin Blighe49k
gravatar for Constantine
18 months ago by
Constantine240 wrote:

Sorry but if your Professor cannot help you find a project, I recommend you find another lab to pursue your Master's thesis. As a Master's student you expect to be given full guidance as it is an essential step before heading towards a PhD program. Asking for project ideas in a forum is not gonna help. Nevertheless, if you decide to stick to the current lab and shape your own project then the GEO database ( or the ENCODE ( is what you are looking for. Both databases have a huge amount of sequencing data.

Good luck!

ADD COMMENTlink written 18 months ago by Constantine240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2161 users visited in the last hour