Question

How to analyze Illumina RNASeq data

2

Entering edit mode

3.7 years ago

Kumar ▴ 170

Hi, I have got large RNASeq data sets from Illumina platform. I am looking to analyze these data in order to find differential gene expression, pathways analysis etc. Please suggest pipelines and workflow.

RNA-Seq sequencing next-gen • 4.3k views

ADD COMMENT • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Thank you all for your informative help.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Do I need to RUN any pipeline to remove adaptors before analyzing the data? I have a total 110 paired-end fastq files generated from Illumina.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Technically no. Most aligners will soft-clip adapters. One reason you may want to do it is to check if some samples are more affected than others. Taking a look at the FastQC reports (use multiQC to collate all FastQC reports) should give you a good idea.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

Hi, I have run STAR with following commands and got the output files. Please let me know which file should be used for featureCounts.

$./STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /DataAnalysis/Manoj-data/test/ --genomeFastaFiles /DataAnalysis/Manoj-data/test/Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile /DataAnalysis/Manoj-data/test/Homo_sapiens.GRCh38.100.gtf --sjdbOverhang 100

$./STAR --runMode alignReads --outSAMtype BAM Unsorted --genomeDir /DataAnalysis/Manoj-data/STAR/pipelines/ --outFileNamePrefix /DataAnalysis/Manoj-data/STAR/pipelines/prefix  --readFilesIn  /DataAnalysis/Manoj-data/STAR/pipelines/2020_278C_R1.fastq /DataAnalysis/Manoj-data/STAR/pipelines/2020_278C_R2.fastq

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Use alignment BAM file.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

Since, I have used gtf file in STAR index command. Should I use gtf file again in featureCount RUN command?

Here is the command for featureCount that I am using.

./featureCounts -T 5 -t exon -g gene_id -a ../home/kumarm/RNA-Seq/chr.gtf -o ../home/kumarm/RNA-Seq/counts.txt ../home/kumarm/RNA-Seq/SRR.bam

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

You post below that you have 180 fastq files. I honestly don't know why you're using STAR because that's going to take an incredibly long time (and lots of RAM) to process all of them. Use super-fast pseudoalignment methods (e.g. kallisto) instead -- benchmarks show that pseudoalignment methods (kallisto and salmon) outperform featureCounts in accuracy.

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Previously, I used HISAT2 for less number of RNASeq data. However, I am not sure about STAR. If it is not usefull, please recommend any other to perform a large number of data. Yes! I have 180 very large files of RNASeq.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

STAR and HISAT2 are good aligners but I recommend using pseudoalignment, especially when you have that much data.

Here's the manual for kallisto: https://pachterlab.github.io/kallisto/starting

Afterwards, for differential gene expression analysis, you can use DESeq2 or sleuth or limma etc. (there are many methods!).

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Hi, I am not very familiar about kallisto. As I am taking a quick look at its manual, it requires transcriptome file. Therefore, where I can get the transcriptome file. I am also not sure about "transcripts.fasta.gz" in the following command. I have paired end reads instead.

Here is the command: kallisto index -i transcripts.idx transcripts.fasta.gz

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Great question! This information is described in the manual. Briefly, you can download the ENSEMBL cDNA FASTA files (should look something like *.cdna.all.fa.gz) -- this will be your "transcripts.fasta.gz" file in the command you put in your comment. Or you can download a prebuilt index and skip the "kallisto index" step altogether.

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Hi, I have RUN kallisto and got the output result for a sample. Finally, I have file called "abundance.tsv". Do I need to run featureCount. It seems .tsv file has already useful information, please let me know which column I need to consider for further use to identify differential gene expression analysis.

    Here is the header of columns.
    target_id   length  eff_length  est_counts  tpm
    ENST00000631435.1   12  13  0   0
    ENST00000415118.1   8   9   0   0

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Yup! You're all good then -- you already have those counts (in est_counts). You're all good to go with differential gene expression analysis (see the differential gene expression analysis programs I linked to above).

Note: if you want to use DESeq2 or limma for your differential gene expression analysis, you should first import that file with tximport (see the kallisto section).

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

After running kallisto, I have got separate result files (abundance.tsv and abundance.h5) for each of my samples. I have just RUN for 6 samples so I have 6 .tsv and .h5 files.

However, I am not sure how to open all results files at tximport. Since I go through the kallisto section in tximport manual, in their command it illustrated one "abundance.h5" as input. How to import all files with tximport and marge into single file to use DESeq2.

Here is the R script that is in the manual: However, after running, it is showing default data. Could you please let me know, what is "samples.txt"?

samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5")
names(files) <- paste0("sample", 1:6)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

The tximport manual is showing how to work with multiple abundance.tsv (or abundance.h5) files. The files variable is actually a list of multiple abundance.tsv files (each abundance.tsv file is in a separate directory).

Just populate that files list in R with the paths to all your abundance.tsv files and after running tximport, you should get a genes x samples matrix.

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

You don't need samples.txt -- just replace samples$run with a list of the names of your samples (the names of your samples are just the directory names where the abundance files are located).

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Hi, I am trying following command, however it is showing error. Please correct me, I am pretty new on this platform.

files <- file.path(dir, "kallisto_boot", "/DataAnalysis/data/kallisto_test/data2/sample1/abundance.h5",
               "/DataAnalysis/data/kallisto_test/data2/sample2/abundance.h5")
names(files) <- paste0("sample", 1:1)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)

ERROR:

txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE) Error in tximport(files, type = "kallisto", txOut = TRUE) : all(file.exists(files)) is not TRUE

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

No worries! OK, try the following:

files <- c("/DataAnalysis/data/kallisto_test/data2/sample1/abundance.h5", "/DataAnalysis/data/kallisto_test/data2/sample2/abundance.h5")
names(files) <- paste0("sample", 1:length(files))
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)

If that doesn't work, double-check and make sure that those two files actually exist on your filesystem.

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Now, it worked. I changed slightly in the coding.

files <- c("/DataAnalysis/Manoj-data/kallisto_test/data2/sample1/abundance.h5", 
       "/DataAnalysis/Manoj-data/kallisto_test/data2/sample2/abundance.h5")
 names(files) <- paste0("sample", 1:2)
 txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
 head(txi.kallisto$counts)

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

However, I am not sure what should I do next. Do I need to make a metadata file for DESeq2? I am lost in DeSeq2 manual, could you please suggest some initial steps where should I start in the manual of DeSeq2.

Also, it shows gene ID in its column 1, how these ids will change genes name. It shows the following matrix after running the command (head(txi.kallisto$counts).

sample1 sample2
ENST00000631435.1       0       0
ENST00000415118.1       1       6
ENST00000434970.2       0       0
ENST00000448914.1       0       0

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Yup! You're doing great; you've figured out how to create a genes x samples matrix.

From here, just follow the DESeq2 steps (create a sampleTable then use DESeqDataSetFromTximport) in the manual.

Also, by the way, you're using transcript IDs (ENST...) -- you need to change them into gene IDs (the tutorial explains how to make a tx2gene data frame to get you gene IDs from the transcript IDs). Once you have the gene IDs, you also have to convert them into gene names. It's a fairly involved process to go from transcript IDs -> gene IDs -> gene names, but many people have asked these questions in the past and you can find an answer by doing a Google search.

You're almost there but you've got a few more steps to do before you can complete your analysis. This chain of comments has gotten pretty long so I recommend posting new questions on Biostars so that more of the community can see and respond. For example, if you can't figure out how to make a sampleTable for DESeqDataSetFromTximport, post that as a new question and mention where you're getting stuck. If you can't figure out how to get from transcript IDs to gene IDs or from gene IDs to gene names after Google searching, you can post that as another question on Biostars.

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Sure! Thank you for your help.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Hi, please let me know how to RUN kallisto for multiple files. I am trying this loop, but I think it is not correct.

for file in *.fastq.gz; do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"; done

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Your reads are paired-end (i.e. you have two fastq files per sample).

Let's say each sample has a _R1.fastq.gz and an _R2.fastq.gz file associated. You'd want to do this (basically, extracting all the sample names by removing the _R1.fastq.gz and _R2.fastq.gz via cut):

for file in $(ls *.fastq.gz|rev|cut -c 13-|uniq|rev); do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"_R1.fastq.gz "${file}"_R2.fastq.gz; done

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

FYI, if you want to speed things up, you can use more threads for kallisto via the -t flag. For example, if you want to use 8 parallel threads, you can do kallisto quant -t 8 (followed by the rest of the kallisto command).

ADD REPLY • link 3.7 years ago by dsull ★ 5.8k

0

Entering edit mode

Hi dsull, I have some more follow-up questions regarding RNA-Seq data analysis using Kallisto. Could you please answer my post here https://www.biostars.org/p/469769/?

ADD REPLY • link 3.5 years ago by Kumar ▴ 170

0

Entering edit mode

Thanks, I'm following that thread -- it's a DESeq2-related question that ATPoint has just responded to. Let us know if any further issues arise.

ADD REPLY • link 3.5 years ago by dsull ★ 5.8k

0

Entering edit mode

Hi dull, I did not get my answer in the post. My initial question is how to import these two factors types of data with different numbers. I will use tx2gene after importing the data.

ADD REPLY • link 3.5 years ago by Kumar ▴ 170

0

Entering edit mode

If you have 180 BAM files and you want to construct an alignment matrix from them you will need to use something similar to the command below:

./featureCounts -T 5 -t exon -g gene_id -a ../home/kumarm/RNA-Seq/chr.gtf -o ../home/kumarm/RNA-Seq/counts.txt BAM1 BAM2 BAM3 BAM4 ... BAM180

This will create a matrix of counts with genes in rows and the sample in columns.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

you can follow salmon-deseq2 and salmon-wasabi-sleuth pipeline for RNAseq here: https://digibio.blogspot.com/2018/05/rna-seq-data-analysis-using-salmon.html

ADD REPLY • link 3.7 years ago by cpad0112 21k

score 8 · Answer 1 · 2020-07-26

8

Entering edit mode

3.7 years ago

GenoMax 141k

Do basic QC (fastqc).
Align using an appropriate aligner (use splice aware aligners like STAR, bbmap, HISAT2 if needed for your organism) or use salmon (LINK) to quantify against a transcriptome.
Count using featureCounts
Use DESeq2 to do DE analysis.

ADD COMMENT • link 3.7 years ago by GenoMax 141k

score 4 · Answer 2 · 2020-07-26

4

Entering edit mode

3.7 years ago

Mark ★ 1.5k

IMO the edgeR package is amazing, has great documentation and is highly validated. A really good tutorial for differential expression is this one:

https://f1000research.com/articles/5-1438

For complete tutorial theres: https://github.com/griffithlab/rnaseq_tutorial

and: https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_dge_basic/rna_seq_basic_tutorial/

There's a massive amount of info online, google is your friend

ADD COMMENT • link 3.7 years ago by Mark ★ 1.5k

0

Entering edit mode

Thank you! It is really helpful.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

score 4 · Answer 3 · 2020-07-27

4

Entering edit mode

3.7 years ago

brunobsouzaa ▴ 830

1 - FastQC;

2 - STAR alignment;

3 - Get ReadCounts using Rsubreads;

4 - Differential expression using DEseq2 and EdgeR;

5 - Pathway analysis (Panther, GSEA, etc...);

ADD COMMENT • link 3.7 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

Is it Ok to use Hisat2 instead of STAR for aligning?

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

I have used a couple of different genome aligners. Among all liked STAR the most and provide good and consistent results. Take a look at the following thread: HISAT2 V.S. STAR

STAR is better in general (but again depends on what you really want to do) for genome alignment and it is easy to use too! If your goal eventually is to do differential gene expression analysis and pathway analysis, I would probably use STAR for alignment.

If you are confirmable in using Hisat2, I don't see a problem using that either.

ADD REPLY • link 3.7 years ago by jkkbuddika ▴ 190

0

Entering edit mode

Hi, will I need to write a script or prepare bash file to RUN STAR with multiple fastq.gz files or STAR has their own command to RUN all fastq files with the single command? I have a total 180 fastq.gz files.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

You can write a simple for loop to run all of your samples.

ADD REPLY • link 3.7 years ago by brunobsouzaa ▴ 830

score 3 · Answer 4 · 2020-07-26

3

Entering edit mode

3.7 years ago

jkkbuddika ▴ 190

Simply, (1) Check the quality of raw sequences (FastQC), (2) Remove adapters and low quality reads (this is OPTIONAL and can use a tool like Cutadapt), (3) map to a reference genome of interest (STAR), (4) prepare count matrices (Subread featureCounts), (5) differential gene expression analysis using R package DESeq2 or edgeR and (6) pathway analysis using a program like clusterProfiler.

I have written a complete pipeline that I use for my RNA-seq data analyses: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer (This pipeline takes care of steps 1-4)

The repository also contain a detailed step-by-step user guide that would walk you through getting things started. The Python pipeline eventually generates count tables using featureCounts and additionally bedgraphs for IGV visualization using DeepTools. These final count tables can be used as an input for DESeq2 or edgeR to do the differential gene expression analysis. Let me know if you need any help.

ADD COMMENT • link 3.7 years ago by jkkbuddika ▴ 190

0

Entering edit mode

Thank you for your help! I will take a look and get back to you if I need help.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Hi, Does the pipeline able to perform differential gene analysis? Do I need to download all Softwares described in the README file? Can you tell me how to RUN this pipeline? I have 110 paired reads samples.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Yes! Take a look at the User guide: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer/blob/master/USERGUIDE.md

User guide describes 4 simple steps to get you going. Yes! You have to have all the programs installed and have sufficient memory to run the pipeline. I usually install all programs using miniconda (see Step 1 of the user guide).

The pipeline analyze both single and paired end sequencing files. You just have to specify which mode you want to analyze. Again, let me know if you need any more assistance.

The pipeline itself will not perform the differential gene expression analysis. You have to use an R package such as DEseq2 or edgeR to do that. The pipeline generates count tables that are the input for these packages. Let me know if you need any help with this too.

Cheers!

ADD REPLY • link 3.7 years ago by jkkbuddika ▴ 190

0

Entering edit mode

Hi, I have reported an issue at Github, please go through it.

ADD REPLY • link 3.7 years ago by Kumar ▴ 170

0

Entering edit mode

Responded! Take a look.

ADD REPLY • link 3.7 years ago by jkkbuddika ▴ 190