I'm working on a project and could use some advice in terms of best practices and tool suggestions! (I'm new to this!)
- 2 technical replicates of RNA seq data for ~50 strains of a pathogenic fungus which has a highly variable and highly repetitive genome.
- 5 genomes in FastA format
- 1 published genome with FastA and GFF3
- 1 transcriptome (has 1 FastA file, and 1 GFF3 file) *more on this shortly
Progress thus far
At first I had aligned my reads to the transcriptome (was published from a former lab member) and found that on average only 75% of reads were aligning though the range was from 25% through 85% (using Tophat). Even for an organism with high genetic variability this seemed strange so I set up a local BLAST database with the FastA genomes and queried using the originally unaligned reads (for all 50 strains) and found that a significant amount of reads were matching with high specificity to the 5 genomes in BLAST. I came to the conclusion that the transcriptome was lacking many (~30%) genes.
At this point I went back to the beginning and aligned each of the 50 strains to each of the 5 genomes (Tophat, creating 250 files), took those reads and created assembled transcript files (Cufflinks), and lastly merged the transcript files (per genome) into 5 transcriptome assemblies (CuffMerge).
At this point I began running CuffDiff to analyze differential gene expression. The original 50 read files were being analyzed 5 times, once for each new transcriptome.
My mentor suggested that it would be better off to kill the process merge all of the 5 new transcriptomes together along with the original transcriptome that was lacking information. I initially looked at CuffMerge again and feed it a list of my 5 transcriptome GTF files along with the original transcriptomes GTF (I'd have to convert GFF3 -> GTF) to make one master reference.
Does this seem OK? I would greatly appreciate any advice that anyone has to offer! Thanks!