RNA-Seq DE advice please!
1
1
Entering edit mode
8.4 years ago
mss6148 ▴ 10

I'm working on a project and could use some advice in terms of best practices and tool suggestions! (I'm new to this!)

The Setup

I have:

  1. 2 technical replicates of RNA seq data for ~50 strains of a pathogenic fungus which has a highly variable and highly repetitive genome.
  2. 5 genomes in FastA format
  3. 1 published genome with FastA and GFF3
  4. 1 transcriptome (has 1 FastA file, and 1 GFF3 file) *more on this shortly

Progress thus far

At first I had aligned my reads to the transcriptome (was published from a former lab member) and found that on average only 75% of reads were aligning though the range was from 25% through 85% (using Tophat). Even for an organism with high genetic variability this seemed strange so I set up a local BLAST database with the FastA genomes and queried using the originally unaligned reads (for all 50 strains) and found that a significant amount of reads were matching with high specificity to the 5 genomes in BLAST. I came to the conclusion that the transcriptome was lacking many (~30%) genes.

At this point I went back to the beginning and aligned each of the 50 strains to each of the 5 genomes (Tophat, creating 250 files), took those reads and created assembled transcript files (Cufflinks), and lastly merged the transcript files (per genome) into 5 transcriptome assemblies (CuffMerge).

At this point I began running CuffDiff to analyze differential gene expression. The original 50 read files were being analyzed 5 times, once for each new transcriptome.

My mentor suggested that it would be better off to kill the process merge all of the 5 new transcriptomes together along with the original transcriptome that was lacking information. I initially looked at CuffMerge again and feed it a list of my 5 transcriptome GTF files along with the original transcriptomes GTF (I'd have to convert GFF3 -> GTF) to make one master reference.

Question

Does this seem OK? I would greatly appreciate any advice that anyone has to offer! Thanks!

RNA-Seq • 2.2k views
ADD COMMENT
0
Entering edit mode

Yeah I think that's the point of CuffMerge, you can make one big transcriptome and some transcripts will not be present in some samples, but they all can align and quantify to the same reference. Then run it through RSEM to get TPM at each transcript for each sample. Later you can see which transcripts are present in which samples, and by how much.

ADD REPLY
0
Entering edit mode
8.4 years ago
mss6148 ▴ 10

Thank you Karl!

I have another question if you're still around.

When I initially aligned my reads for each sample to the transcriptome, I found that on average ~75% of reads were actually aligning. I created a local BLAST database consisting of the 5 published genomes and searched with my unaligned reads and found a very significant amount were aligning with very good scores! This led me to believe that the transcriptome is missing many genes.

My question is what is the best way to add these missing genes to the current transcriptome?

Take the BLAST hits that exceed a certain score, convert to FastA then run through Tophat (against one of the published genomes) to produce assembled transcripts. Then using CuffMerge, merge this new assembled transcriptome with the old?

Thanks in advance!

ADD COMMENT
0
Entering edit mode

Why are you blasting reads? Map your reads to the genome with tophat, as opposed to transcriptomes, so that you may see how many reads should be mapping and roughly what percent you are missing. You can also maybe predict transcripts denovo with that alignment and blast predicted transcripts against existing ones to see what was missing initially. There is a lot of digging around into this data that you will need to do before understanding it at a very good level.

ADD REPLY

Login before adding your answer.

Traffic: 2653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6