Question: Sanity Check on RNA-Seq featureCounts / DESeq2
gravatar for gwebste7
15 months ago by
gwebste70 wrote:

Hello. My colleague and I are trying to run a differential gene analysis to compare protein activity in cutthroat trout under normal and heat stressed condition. We are following in the footsteps of an earlier article that did pretty much the same thing in rainbow trout. While we wait until our data is available, we are trying to duplicate the steps followed by the rainbow trout researchers, who published their data sets as SRA files. Here's what we've done so far.

Acquire Data Files 1) Download the O.mykiss (rainbow trout) complete genome from NCBI in FASTA format.
2) Download the RNA-Seq data files DRR001887 and DRR001888 for the published paper. We have these in FASTQ format.
3) Download the GFF file from NCBI for the genome.

Initially, we tried to do the work in Galaxy 18, but we could never get it to use more than one CPU, so we ran the following commands manually. We thought we would use hisat2 for alignment and DESeq2 for differential genome expression. My guess is that the problem is somewhere near the end.

Phase 1 - use hisat2 to align the published SRA data against the reference O.mykiss genome. The goal is to create a SAM file, then convert to a BAM file. I also use samtools create a BAI file to aid visualization in IGV (which seems to work fine) and sort the BAM file. I use the the optionto run 4 threads.

1) sudo  hisat2-build -p 4 dataset_6.dat newgome
2) sudo ln -f -s dataset_5.dat
3) sudo hisat2 -p 4 -x 'newgenome' -U input_f.fastq.gz
4) sudo samtools view -bS -@ 4 first.sam > first.bam
5) sudo samtools index first.bam first.bai -@ 4
6) sudo samtools sort first.bam first_sorted.bam -@ 4

At this point, I have a sorted BAM file. My plan is to use this BAM file to generate gene expression counts with featureCounts. That is supposed to create a counts.txt file that can be run through DESeq2.

7) Install R 3.5 and related libraries, following directions on Bioconductor. After correcting some issues related to Ubuntu 18.04 LTS and incorrect repository, this step seems to go without incident.

At this point, I realize that I need the GFF genome annotation for O.mykiss. I download this from NCBI and gunzip it. 8) wget ... 9) gunzip -c GFC*.gz > O.mykiss.gff

I download and install featureCounts with sudo apt install . 10) featureCounts -T 4 -F 'GFF' -a O.mykiss.gff -o counts.txt first_sorted.bam

At this point, the command seems to run for a few minutes and does produce the counts.txt and counts.txt.summary files. When I run the featureCounts command from the console, I get a warning file about my GFF file not being in the correct format. It appears to run anyway. The output shows that it processed 860000 features and 48.3% of the assigned reads. (See attached screen shot.)

My concern is that when I look at the counts.txt file (29.5 MB), there are thousands of rows of data, but I don't see anything resembling gene names or counts. The files from other examples do not resemble mine. I feel like my counts.txt is coming back with nonsense because my GFF file is somehow incompatible or not in sync with my alignment BAM file. Since I am pulling data from multiple researchers, perhaps there is some disconnect.

I tried converting the GFF to GTF file with gffread, which worked, but did not yield anything different.

I also try a few command line switches in featureCounts for -g and -T, but this seems to yield nothing different.

Here are links to my counts.txt and GFF. Again, the GFF is not my own but from NCBI's page on the rainbow trout genome. The DRR SRA files are also from third party researchers. (29 MB)

O.mykiss Genome in FASTA
O.mykiss GFF file
DRR SRA file #1
DRR SRA file #2

If any kind soul has any feedback on what I am doing wrong, I would greatly appreciate it. I am new to bioinformatics, just trying to get up to speed. I am happy to reconfigure my Linux VM or download other alternative tools, as needed. I have a VMware VM running with plenty of RAM and storage.
I'm happy to sweeten the deal with Amazon AWS credits or cash if anyone would show me the best practices for how to complete this exercise. I would even sponsor someone to visit us in Denver for a weekend if they felt like doing a little hand holding on this project. We think hisat2 / DESeq2 is the best approach based on reading through various exercises and tutorials, but we don't know enough to make an informed determination. We need to figure this out before we send off our pure RNA for sequencing this summer.

Thanks for all your help. George

rna-seq deseq featurecounts • 715 views
ADD COMMENTlink modified 15 months ago by Michael Dondrup47k • written 15 months ago by gwebste70

Hi, I think your counts look ok. The ids used are transcripts as expected. Only settings I would add is strand options in featureCounts. Counts are in the last column with the file name as header. Cheers

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup47k

Another thing, why are you running your commands via sudo? That is not recommended.

ADD REPLYlink written 15 months ago by Michael Dondrup47k
gravatar for Michael Dondrup
15 months ago by
Bergen, Norway
Michael Dondrup47k wrote:

With respect to the output, everything is as expected when using the recorded parameters, which are a bit different from what you gave in your protocol:

# Program:featureCounts v1.6.0; Command:"featureCounts" "-T" "4" "-t" "exon" "-g" "transcript_id" "-F" "GTF" "-a" "O.mykiss.gtf" "-o" "counts.txt" "first_sorted.bam"

The warning might be caused by featureCounts expecting a GTF file while you provided a GFF3 formatted file, but that should not cause problems

The output format is documented in the column headers:

Geneid  Chr Start   End Strand  Length  first_sorted.bam
*rna-NC_001717.1:1004..1071*    NC_001717.1 1004    1071    +   68  **18**

I marked the transcript id * * which is in the Geneid column and the counts * *

With respect to differential expression analysis, this dataset is not suitable for such analysis, unfortunately.

Have a look at the Bioproject:

There are only 2 samples and no replication, therefore, there is no meaningful way to conduct DE analysis between the transcriptomes. I hope that your experiment design contains replication.

There are also some specific aspects about your specific fish that are very relevant to the analysis: Oncorhynchus clarkii

  • O. clarkii, unlike rainbow trout (Oncorhynchus mykiss), has no public reference genome. Unless you have a non-public version in your lab, you have to use a transcriptome assembly to quantify. Therefore, the workflow presented does not apply directly.
  • On the other hand, possibly the transcripts could be alignable across species, however this needs to be established first, and is unlikely to work well due to the next point.
  • O. clarkii is a salmonid. Salmonids have an additional round of whole genome duplication in comparison to other teleosts, yielding genomes with large blocks of high sequence similarity (Lien et al 2016, Figure 1) This might pose additional difficulties in transcript quantification, and might opt against using pseudo alignments.
ADD COMMENTlink modified 15 months ago • written 15 months ago by Michael Dondrup47k

Hi, Michael. Thanks for helping as we stumble around like this.

  1. Regarding the data from this project, how would someone like us replicate the prior researcher's workflow and confirm their conclusion? If we have only the two samples, which I was thinking about as the control vs. the experimental samples, is it a question of robustness of results, or is the lack of replication in control vs. experimental more fundamental than that? It is not too late for us to rework our experimental design at this point, so better to know now than after we've spent the money.

  2. We do not have a private reference genome for O. clarkii. When the experiment was conceived, I think it was hoped that we could do an alignment against O. mykiss's reference genome, but that hope may not hold up, and I don't want it to be a showstopper for my colleague. What do you recommend for assembly given limited resources?

ADD REPLYlink written 15 months ago by gwebste70


Firstly, I would always use an experiment that is published in a good journal to reproduce the steps. I cannot find a publication attached to the Bioproject you are using. Try this query in SRA:, then look if you find something interesting. For example, try this one: (Fish & Shellfish Immunology is a good journal in that field, but it's Elsevier, so if you can get access...) I wouldn't exaxtly replicate their computational methods (TopHat + Htseq + DEseq(1?)) but you might want to try your pipeline and if you can beat their mapping rates (53%). As you can see, already with the same organism, mapping rates are not always that great.

Secondly, with respect to not having replicates, this case has been discussed here very often, and I am not going to repeat it in depth. In short, you don't get to do a valid statistical analysis based on a single sample per group, as it is impossible to estimate variability. There are some options in EdgeR to deal with non-replicated experiments but I assume the authors already regret having implemented that.

Regarding the experiment design, hope is not a specifically great adviser ;) You can however test the mappability of reads by using one of the published datasets from the organism on SRA. If you cannot go down that road, you can assemble the transcriptome using your own data or better some independent published datasets in Trinity, etc. Then map back the reads to the assembly and quantify.

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup47k
gravatar for ATpoint
15 months ago by
ATpoint36k wrote:

I would go for the salmon-tximport-DESeq2 pipeline. It is computationally inexpensive and also most up-to-date in terms of GC bias correction and handling of multimapping reads. Should be possible to run on a laptop. If you need fastq files from NCBI, consider downloading them in fastq format directly from the ENA ( Fast download of FASTQ files from the European Nucleotide Archive (ENA) ). Simply follow the linked workflow.

Some general things: Do not run commands via sudo. This is not necessary if tools are set up properly. Also do better not invest time in posting download links. You seem trustworthy but no sane user will ever klick download links from an unknown source. ALso please use the code option to highlight code and data examples. If there are any questions feel free to ask.

enter image description here

ADD COMMENTlink modified 15 months ago • written 15 months ago by ATpoint36k

Thanks for the tip on the salmon-tximport-DESeq2 pipeline. I will check into it.

ADD REPLYlink written 15 months ago by gwebste70

I don't think DESeq will handle 1 replicate/condition at all. EdgeR will, though what you get is of limited value.

ADD REPLYlink written 15 months ago by swbarnes28.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 659 users visited in the last hour