I am a physicist by training, and I am very comfortable programming and working in a UNIX environment. I recently took a programming / algorithm development job related to biology / genetics / bioinformatics, which are fields where that I am very new to. I have a set of mRNA data (in *.fastq format) that was taken by a sequencer. I am developing a pipeline, looking at relative gene expression.
To test my pipeline, I am trying to analyze an old data set. The problem is that the input data is several GB in size and it takes 7+hours to run. This makes it challenging to learn how to use the tophat and work with the data.
After aligning the sequence using tophat v1.4.1, against ucsc mm10 (downloaded from https://support.illumina.com/sequencing/sequencing_software/igenome.html), and comparing it to the already analyzed data, it appears that roughly 15% of the lines in the output data are different from the originally analyzed data file. This comparison was done by using samtools v0.1.18 to write the BAM files (output from tophat) to text files and then using UNIX diff to compare. Unfortunately, the person who did this original analysis is incommunicado.
QUESTION : Are there any good tutorials using tophat to do sequence alignment, as well as, analyzing sequencing data for relative gene expression? Toy problems would be great.