Question: What "questions" could be answered with the RNAseq data that I have? I need to make my data fit for a certain experiment format.
I am high-schooler who did a bioinformatics internship over the summer. Since I am pretty young, this was more of an introductory experience for me to learn how to use software to detect alternative splicing in RNAseq. I accumulated a lot of data, but didn't necessary have a very specific goal or research question that I was trying to answer. Now, I need to write a paper for a high school science competition using this data. The paper format requires me to have a very specific "purpose" with clear control and experimental groups. For example, the purpose might be to find the effect of caffeine on running speed, the control group is people running without caffeine, and experimental is the people running after drinking caffeine. I am not sure how to fit my data into this format and what the control/experimental groups should be.

Here is the data that I have:

  • I used MISO software to find alternatively spliced isoforms in iPSC-differentiated dopaminergic (DA) and cortical neurons at various stages (iPS, day10, d20, ...). The output includes values like Percent Spliced In (PSI), bayes factors, comparisons between genes of different day stages as well as between cortical and DA neurons of the same day. I focused on an SNP on a gene to see if the SNP was creating an extra isoform (I ended up finding 1 read of this isoform). I also looked at the abundance of all the isoforms of the gene.The iPS cell lines I used, though, were just general DA and cortical neurons, and didn't specifically have the risk allele of the SNP edited in.
  • I then used HOMER software to produce some genome wide statistics. I was provided with a huge list of genes that are differentially expressed between stages of neuronal development, and I used my data to find the percentage of overlap between DE and AS genes at each stage.

Without data for iPSC-differentiated cell lines with the SNP genome-edited in, I can't write my paper on the effect of the SNP on alternative splicing in the specific gene, because I wouldn't have an experimental group. I am thinking of writing on the effect of neuron type on the SNP, since the SNP-caused isoform was more abundant in DA than cortical neurons. It seems that the cortical neuron data will need to be my control group, while the DA neurons are my experimental group (although I'm not even sure what the significance of this would be). The place that I did this internship focused on genetic brain disorders like schizophrenia that could be affected by dominergic receptors, which is why I was doing work on DA neurons. What are some ways I can make my data fit the format I need?

Side note: The title would give data scientists nightmares. "I need to make my data fit" is a huge sin in that you're cherry-picking facts to fit theories :)

ADD REPLYlink written 2.9 years ago by RamRS19k

One does experiments to try and answer some question(s) not the other way around. Getting data first then fishing for answers to random questions is a recipe for disaster. This being said, your internship work was probably done as part of a larger project in relation to some specific question(s) so why not look into this a bit more ? Can't your former supervisor give you some advice ?

ADD REPLYlink written 2.9 years ago by Jean-Karim Heriche16k
