Question

Sanity Check on RNA-Seq featureCounts / DESeq2

0

Entering edit mode

5.0 years ago

gwebste7 • 0

Hello. My colleague and I are trying to run a differential gene analysis to compare protein activity in cutthroat trout under normal and heat stressed condition. We are following in the footsteps of an earlier article that did pretty much the same thing in rainbow trout. While we wait until our data is available, we are trying to duplicate the steps followed by the rainbow trout researchers, who published their data sets as SRA files. Here's what we've done so far.

Acquire Data Files 1) Download the O.mykiss (rainbow trout) complete genome from NCBI in FASTA format.
2) Download the RNA-Seq data files DRR001887 and DRR001888 for the published paper. We have these in FASTQ format.
3) Download the GFF file from NCBI for the genome.

Initially, we tried to do the work in Galaxy 18, but we could never get it to use more than one CPU, so we ran the following commands manually. We thought we would use hisat2 for alignment and DESeq2 for differential genome expression. My guess is that the problem is somewhere near the end.

Phase 1 - use hisat2 to align the published SRA data against the reference O.mykiss genome. The goal is to create a SAM file, then convert to a BAM file. I also use samtools create a BAI file to aid visualization in IGV (which seems to work fine) and sort the BAM file. I use the the optionto run 4 threads.

1) sudo  hisat2-build -p 4 dataset_6.dat newgome
2) sudo ln -f -s dataset_5.dat input_f.fast.gz
3) sudo hisat2 -p 4 -x 'newgenome' -U input_f.fastq.gz
4) sudo samtools view -bS -@ 4 first.sam > first.bam
5) sudo samtools index first.bam first.bai -@ 4
6) sudo samtools sort first.bam first_sorted.bam -@ 4

At this point, I have a sorted BAM file. My plan is to use this BAM file to generate gene expression counts with featureCounts. That is supposed to create a counts.txt file that can be run through DESeq2.

7) Install R 3.5 and related libraries, following directions on Bioconductor. After correcting some issues related to Ubuntu 18.04 LTS and incorrect repository, this step seems to go without incident.

At this point, I realize that I need the GFF genome annotation for O.mykiss. I download this from NCBI and gunzip it. 8) wget ... 9) gunzip -c GFC*.gz > O.mykiss.gff

I download and install featureCounts with sudo apt install . 10) featureCounts -T 4 -F 'GFF' -a O.mykiss.gff -o counts.txt first_sorted.bam

At this point, the command seems to run for a few minutes and does produce the counts.txt and counts.txt.summary files. When I run the featureCounts command from the console, I get a warning file about my GFF file not being in the correct format. It appears to run anyway. The output shows that it processed 860000 features and 48.3% of the assigned reads. (See attached screen shot.)

My concern is that when I look at the counts.txt file (29.5 MB), there are thousands of rows of data, but I don't see anything resembling gene names or counts. The files from other examples do not resemble mine. I feel like my counts.txt is coming back with nonsense because my GFF file is somehow incompatible or not in sync with my alignment BAM file. Since I am pulling data from multiple researchers, perhaps there is some disconnect.

I tried converting the GFF to GTF file with gffread, which worked, but did not yield anything different.

I also try a few command line switches in featureCounts for -g and -T, but this seems to yield nothing different.

Here are links to my counts.txt and GFF. Again, the GFF is not my own but from NCBI's page on the rainbow trout genome. The DRR SRA files are also from third party researchers.

https://crypticresponse.s3.amazonaws.com/static/rnaseq/counts.txt (29 MB)
https://crypticresponse.s3.amazonaws.com/static/rnaseq/counts.txt.summary

O.mykiss Genome in FASTA
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/omykiss_genome.fasta
O.mykiss GFF file
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/O.mykiss.gff
DRR SRA file #1
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/DRR001887.sra
DRR SRA file #2
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/DRR001888.sra

If any kind soul has any feedback on what I am doing wrong, I would greatly appreciate it. I am new to bioinformatics, just trying to get up to speed. I am happy to reconfigure my Linux VM or download other alternative tools, as needed. I have a VMware VM running with plenty of RAM and storage.
I'm happy to sweeten the deal with Amazon AWS credits or cash if anyone would show me the best practices for how to complete this exercise. I would even sponsor someone to visit us in Denver for a weekend if they felt like doing a little hand holding on this project. We think hisat2 / DESeq2 is the best approach based on reading through various exercises and tutorials, but we don't know enough to make an informed determination. We need to figure this out before we send off our pure RNA for sequencing this summer.

Thanks for all your help. George

RNA-Seq DESeq featureCounts • 2.3k views

ADD COMMENT • link updated 5.0 years ago by Michael 54k • written 5.0 years ago by gwebste7 • 0

0

Entering edit mode

Hi, I think your counts look ok. The ids used are transcripts as expected. Only settings I would add is strand options in featureCounts. Counts are in the last column with the file name as header. Cheers

ADD REPLY • link 5.0 years ago by Michael 54k

0

Entering edit mode

Another thing, why are you running your commands via sudo? That is not recommended.

ADD REPLY • link 5.0 years ago by Michael 54k

score 0 · Answer 1 · 2019-04-23

With respect to the output, everything is as expected when using the recorded parameters, which are a bit different from what you gave in your protocol:

# Program:featureCounts v1.6.0; Command:"featureCounts" "-T" "4" "-t" "exon" "-g" "transcript_id" "-F" "GTF" "-a" "O.mykiss.gtf" "-o" "counts.txt" "first_sorted.bam"

The warning might be caused by featureCounts expecting a GTF file while you provided a GFF3 formatted file, but that should not cause problems

The output format is documented in the column headers:

Geneid  Chr Start   End Strand  Length  first_sorted.bam
*rna-NC_001717.1:1004..1071*    NC_001717.1 1004    1071    +   68  **18**

I marked the transcript id * * which is in the Geneid column and the counts * *

With respect to differential expression analysis, this dataset is not suitable for such analysis, unfortunately.

Have a look at the Bioproject: https://www.ncbi.nlm.nih.gov/bioproject/237593

There are only 2 samples and no replication, therefore, there is no meaningful way to conduct DE analysis between the transcriptomes. I hope that your experiment design contains replication.

There are also some specific aspects about your specific fish that are very relevant to the analysis: Oncorhynchus clarkii

O. clarkii, unlike rainbow trout (Oncorhynchus mykiss), has no public reference genome. Unless you have a non-public version in your lab, you have to use a transcriptome assembly to quantify. Therefore, the workflow presented does not apply directly.
On the other hand, possibly the transcripts could be alignable across species, however this needs to be established first, and is unlikely to work well due to the next point.
O. clarkii is a salmonid. Salmonids have an additional round of whole genome duplication in comparison to other teleosts, yielding genomes with large blocks of high sequence similarity (Lien et al 2016, Figure 1) This might pose additional difficulties in transcript quantification, and might opt against using pseudo alignments.

score 0 · Answer 2 · 2019-04-23

0

Entering edit mode

5.0 years ago

ATpoint 82k

I would go for the salmon-tximport-DESeq2 pipeline. It is computationally inexpensive and also most up-to-date in terms of GC bias correction and handling of multimapping reads. Should be possible to run on a laptop. If you need fastq files from NCBI, consider downloading them in fastq format directly from the ENA ( Fast download of FASTQ files from the European Nucleotide Archive (ENA) ). Simply follow the linked workflow.

Some general things: Do not run commands via sudo. This is not necessary if tools are set up properly. Also do better not invest time in posting download links. You seem trustworthy but no sane user will ever klick download links from an unknown source. ALso please use the code option to highlight code and data examples. If there are any questions feel free to ask.

enter image description here

ADD COMMENT • link 5.0 years ago by ATpoint 82k

0

Entering edit mode

Thanks for the tip on the salmon-tximport-DESeq2 pipeline. I will check into it.

ADD REPLY • link 5.0 years ago by gwebste7 • 0

0

Entering edit mode

I don't think DESeq will handle 1 replicate/condition at all. EdgeR will, though what you get is of limited value.

ADD REPLY • link 5.0 years ago by swbarnes2 14k