Hello everyone, I am new in bioinformatics, I have a several tasks to do, but I am realy confused how can I do that. What is needed is (given from my prof): given a set of gene expression data (let's use RNAseq-data for fly to keep the memory and CPU efforts down), you map them to the genome. This gives, for each sample/data set, a single signal "expression value", i.e., coverage f(x) as a function of the genomic coordinate x. Now that task is to compute segmentations of this signal, i.e., find a set of intervals on which f is approximately constant. First do this for every data set separately. Now we have a more difficult problem. Given the f_i(x) for each data set i, find a segmentation so that EACH f_i is approximately constant on each interval. Of course, you want segmentations that have as few intervals as possible. I would suggest to do two things: (1) find a set of about 12 different RNAseq data sets from the fruitfly and map them to the genome. (2) re-implement the simplest segmentation algorithms for time series-like data and test them. (3) check how consistent are the results. (4) how can we combine the different signal f_i to define a single criterion for segmenting the signal jointly. The point now is that, of course, we want that the number of segments that we are defining only slowly grows with i and eventually saturates, since otherwise you just wind up with every genomic position being its own interval -- which is of course a useless segmentation.
can any one explain what this tasks mean exactly: 1-from where I can get the RNAseq (GEO, SRA, FLYBase...)
2- what is "single signal expression value" and coverage?
3- the sequences from databases do not contains coverage? should i calcul the coverage?!! if so, from where I get nomber of reads!!
4-What are the segmentation algorithm that should be used?