Hello everyone,
I am new in bioinformatics, I have a several tasks to do, but I am really confused how can I do that.
What is needed is (given from my prof):
Given a set of gene expression data (let's use RNAseq-data for fly to keep the memory and CPU efforts down), you map them to the genome. This gives, for each sample/data set, a single signal "expression value", i.e., coverage f(x) as a function of the genomic coordinate x.
Now that task is to compute segmentations of this signal, i.e., find a set of intervals on which f is approximately constant.
First do this for every dataset separately.
Now we have a more difficult problem. Given the f_i(x)
for each data set i
, find a segmentation so that EACH f_i
is approximately constant on each interval.
Of course, you want segmentations that have as few intervals as possible.
I would suggest to do two things:
(1) find a set of about 12 different RNAseq data sets from the fruitfly and map them to the genome.
(2) re-implement the simplest segmentation algorithms for time series-like data and test them.
(3) check how consistent are the results.
(4) how can we combine the different signal f_i
to define a single criterion for segmenting the signal jointly.
The point now is that, of course, we want that the number of segments that we are defining only slowly grows with I and eventually saturates, since otherwise you just wind up with every genomic position being its owninterval -- which is of course a useless segmentation.
Can anyone explain what this tasks mean exactly:
- From where I can get the RNAseq (GEO, SRA, FLYBase...)
- What is "single signal expression value" and coverage?
- The sequences from databases do not contains coverage? should I calculate the coverage? If so, from where I get number of reads!!
- What are the segmentation algorithm that should be used?
Check How To Ask Good Questions On Technical And Scientific Forums for some guidelines for posting questions on technical and scientific forums. One general recommendation is "do not post homework questions".