Run RNA-seq analysis with selected targets
0
0
Entering edit mode
4.2 years ago
dodausp ▴ 180

I have obtained a considerable large amount of data (~750 samples) from a collaboration study to perform RNA-seq analysis. The data is not raw, and by inspecting the files and some metadata, here is some information gathered:

  • the file sizes range from a few MB to a couple of GB;
  • all files have been aligned by Bowtie;
  • all files have been mapped by Tophat2; and
  • all files have been assemble on hg19.

I tried running a couple of samples through cufflinks, but it took around 4 hours to complete only one sample. Here is the line used:

cufflinks -b hg19.fa -g hg19.refGene.gtf -u [sample.bam]

Well, of course I eventually want to do a thorough analysis of all these samples, but for the time being we are interested to assess the expression level of a set of 40 targets (transcribed genes). That being said, I would like to ask two questions:

1. Would there be a way to run quantification on cufflinks in all samples for only this set of genes (in order to make it faster)? If so, how?

2. Because the samples have been submitted to bowtie and tophat2, do they necessarily need to be run on cufflinks for RNA-seq quantification? If not, would cufflinks still be the best option, or there are other better ones?

Any help is greatly appreciated, given that taken 4 hours to finish each sample quantification would consume an unreasonable amount of time.

Thanks in advance!

RNA-Seq alignment cufflinks • 791 views
ADD COMMENT
1
Entering edit mode

I strongly! recommend to start from raw data. Is this possible? You have no control over what others did with these files. tophat/cufflinks are old tools and by now considered deprecated. Analysis can have many pitfalls, so be sure to have your own pipeline and do not rely on work of others. You are the analyst, so you decide what how to process data. If possible get the raw sequencing data, quantify with a tool such as salmon which is fast and memory-efficient and then aggregate the transcript level estimates it produces to the gene level with tximport. From there proceed with downstream. Always include all genes into the analysis and then filter after analysis. Most assumptions of the statistical frameworks assume many genes in the dataset for their models to work properly such as the normalization and dispersion estimation steps.

ADD REPLY
0
Entering edit mode

Thank you @ATpoint. I agree with you, having the raw data gives one more flexibility and control over the data. The issue however is that it was retrieved from a data bank repository, with restricted access. And the only type of data we have access to are those bam files, already pre-processed in bowtie and tophat2. So, unfortunately we cannot change that. And you are also right in regard to tophat/cufflinks beng old; their last release was in 2014(!).

So, would cufflinks still be the only, or more appropriate solution in this case?

In any case, thanks a lot!

ADD REPLY
0
Entering edit mode

If you have BAM files you still can use other tools to produce the count matrix. I personally would use salmon in its alignment-based mode, see docs here, which will then produce you transcript abundance estimates. These you can then aggregate to the gene level with tximport which will then give you the raw count data to start your actual analysis.

ADD REPLY
0
Entering edit mode

By the way, if the BAM files contain all reads (so not filtered) you can always transform back to fastq and run and quantifier you want.

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6