Question: How to analyze Illumina RNASeq data
1
gravatar for Manoj
19 days ago by
Manoj60
United States
Manoj60 wrote:

Hi, I have got large RNASeq data sets from Illumina platform. I am looking to analyze these data in order to find differential gene expression, pathways analysis etc. Please suggest pipelines and workflow.

sequencing rna-seq next-gen • 458 views
ADD COMMENTlink modified 9 days ago • written 19 days ago by Manoj60

Thank you all for your informative help.

ADD REPLYlink modified 17 days ago • written 17 days ago by Manoj60

Do I need to RUN any pipeline to remove adaptors before analyzing the data? I have a total 110 paired-end fastq files generated from Illumina.

ADD REPLYlink written 17 days ago by Manoj60

Technically no. Most aligners will soft-clip adapters. One reason you may want to do it is to check if some samples are more affected than others. Taking a look at the FastQC reports (use multiQC to collate all FastQC reports) should give you a good idea.

ADD REPLYlink written 17 days ago by genomax87k

Hi, I have run STAR with following commands and got the output files. Please let me know which file should be used for featureCounts.

$./STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /DataAnalysis/Manoj-data/test/ --genomeFastaFiles /DataAnalysis/Manoj-data/test/Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile /DataAnalysis/Manoj-data/test/Homo_sapiens.GRCh38.100.gtf --sjdbOverhang 100

$./STAR --runMode alignReads --outSAMtype BAM Unsorted --genomeDir /DataAnalysis/Manoj-data/STAR/pipelines/ --outFileNamePrefix /DataAnalysis/Manoj-data/STAR/pipelines/prefix  --readFilesIn  /DataAnalysis/Manoj-data/STAR/pipelines/2020_278C_R1.fastq /DataAnalysis/Manoj-data/STAR/pipelines/2020_278C_R2.fastq
ADD REPLYlink modified 1 day ago by RamRS28k • written 9 days ago by Manoj60

Use alignment BAM file.

ADD REPLYlink written 9 days ago by genomax87k

Since, I have used gtf file in STAR index command. Should I use gtf file again in featureCount RUN command?

Here is the command for featureCount that I am using.

./featureCounts -T 5 -t exon -g gene_id -a ../home/kumarm/RNA-Seq/chr.gtf -o ../home/kumarm/RNA-Seq/counts.txt ../home/kumarm/RNA-Seq/SRR.bam
ADD REPLYlink modified 1 day ago by RamRS28k • written 9 days ago by Manoj60

You post below that you have 180 fastq files. I honestly don't know why you're using STAR because that's going to take an incredibly long time (and lots of RAM) to process all of them. Use super-fast pseudoalignment methods (e.g. kallisto) instead -- benchmarks show that pseudoalignment methods (kallisto and salmon) outperform featureCounts in accuracy.

ADD REPLYlink written 9 days ago by dsull1.5k

Previously, I used HISAT2 for less number of RNASeq data. However, I am not sure about STAR. If it is not usefull, please recommend any other to perform a large number of data. Yes! I have 180 very large files of RNASeq.

ADD REPLYlink written 9 days ago by Manoj60

STAR and HISAT2 are good aligners but I recommend using pseudoalignment, especially when you have that much data.

Here's the manual for kallisto: https://pachterlab.github.io/kallisto/starting

Afterwards, for differential gene expression analysis, you can use DESeq2 or sleuth or limma etc. (there are many methods!).

ADD REPLYlink written 9 days ago by dsull1.5k

Hi, I am not very familiar about kallisto. As I am taking a quick look at its manual, it requires transcriptome file. Therefore, where I can get the transcriptome file. I am also not sure about "transcripts.fasta.gz" in the following command. I have paired end reads instead.

Here is the command: kallisto index -i transcripts.idx transcripts.fasta.gz

ADD REPLYlink written 9 days ago by Manoj60

Great question! This information is described in the manual. Briefly, you can download the ENSEMBL cDNA FASTA files (should look something like *.cdna.all.fa.gz) -- this will be your "transcripts.fasta.gz" file in the command you put in your comment. Or you can download a prebuilt index and skip the "kallisto index" step altogether.

ADD REPLYlink written 9 days ago by dsull1.5k

Hi, I have RUN kallisto and got the output result for a sample. Finally, I have file called "abundance.tsv". Do I need to run featureCount. It seems .tsv file has already useful information, please let me know which column I need to consider for further use to identify differential gene expression analysis.

    Here is the header of columns.
    target_id   length  eff_length  est_counts  tpm
    ENST00000631435.1   12  13  0   0
    ENST00000415118.1   8   9   0   0
ADD REPLYlink modified 9 days ago • written 9 days ago by Manoj60

Yup! You're all good then -- you already have those counts (in est_counts). You're all good to go with differential gene expression analysis (see the differential gene expression analysis programs I linked to above).

Note: if you want to use DESeq2 or limma for your differential gene expression analysis, you should first import that file with tximport (see the kallisto section).

ADD REPLYlink written 8 days ago by dsull1.5k

After running kallisto, I have got separate result files (abundance.tsv and abundance.h5) for each of my samples. I have just RUN for 6 samples so I have 6 .tsv and .h5 files.

However, I am not sure how to open all results files at tximport. Since I go through the kallisto section in tximport manual, in their command it illustrated one "abundance.h5" as input. How to import all files with tximport and marge into single file to use DESeq2.

Here is the R script that is in the manual: However, after running, it is showing default data. Could you please let me know, what is "samples.txt"?

samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5")
names(files) <- paste0("sample", 1:6)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)
ADD REPLYlink modified 7 days ago • written 7 days ago by Manoj60

The tximport manual is showing how to work with multiple abundance.tsv (or abundance.h5) files. The files variable is actually a list of multiple abundance.tsv files (each abundance.tsv file is in a separate directory).

Just populate that files list in R with the paths to all your abundance.tsv files and after running tximport, you should get a genes x samples matrix.

ADD REPLYlink written 7 days ago by dsull1.5k

You don't need samples.txt -- just replace samples$run with a list of the names of your samples (the names of your samples are just the directory names where the abundance files are located).

ADD REPLYlink written 7 days ago by dsull1.5k

Hi, I am trying following command, however it is showing error. Please correct me, I am pretty new on this platform.

files <- file.path(dir, "kallisto_boot", "/DataAnalysis/data/kallisto_test/data2/sample1/abundance.h5",
               "/DataAnalysis/data/kallisto_test/data2/sample2/abundance.h5")
names(files) <- paste0("sample", 1:1)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)

ERROR:

txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE) Error in tximport(files, type = "kallisto", txOut = TRUE) : all(file.exists(files)) is not TRUE

ADD REPLYlink modified 6 days ago • written 6 days ago by Manoj60

No worries! OK, try the following:

files <- c("/DataAnalysis/data/kallisto_test/data2/sample1/abundance.h5", "/DataAnalysis/data/kallisto_test/data2/sample2/abundance.h5")
names(files) <- paste0("sample", 1:length(files))
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
head(txi.kallisto$counts)

If that doesn't work, double-check and make sure that those two files actually exist on your filesystem.

ADD REPLYlink modified 1 day ago by RamRS28k • written 6 days ago by dsull1.5k

Now, it worked. I changed slightly in the coding.

files <- c("/DataAnalysis/Manoj-data/kallisto_test/data2/sample1/abundance.h5", 
       "/DataAnalysis/Manoj-data/kallisto_test/data2/sample2/abundance.h5")
 names(files) <- paste0("sample", 1:2)
 txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
 head(txi.kallisto$counts)
ADD REPLYlink modified 6 days ago • written 6 days ago by Manoj60

However, I am not sure what should I do next. Do I need to make a metadata file for DESeq2? I am lost in DeSeq2 manual, could you please suggest some initial steps where should I start in the manual of DeSeq2.

Also, it shows gene ID in its column 1, how these ids will change genes name. It shows the following matrix after running the command (head(txi.kallisto$counts).

sample1 sample2
ENST00000631435.1       0       0
ENST00000415118.1       1       6
ENST00000434970.2       0       0
ENST00000448914.1       0       0
ADD REPLYlink modified 6 days ago • written 6 days ago by Manoj60

Yup! You're doing great; you've figured out how to create a genes x samples matrix.

From here, just follow the DESeq2 steps (create a sampleTable then use DESeqDataSetFromTximport) in the manual.

Also, by the way, you're using transcript IDs (ENST...) -- you need to change them into gene IDs (the tutorial explains how to make a tx2gene data frame to get you gene IDs from the transcript IDs). Once you have the gene IDs, you also have to convert them into gene names. It's a fairly involved process to go from transcript IDs -> gene IDs -> gene names, but many people have asked these questions in the past and you can find an answer by doing a Google search.

You're almost there but you've got a few more steps to do before you can complete your analysis. This chain of comments has gotten pretty long so I recommend posting new questions on Biostars so that more of the community can see and respond. For example, if you can't figure out how to make a sampleTable for DESeqDataSetFromTximport, post that as a new question and mention where you're getting stuck. If you can't figure out how to get from transcript IDs to gene IDs or from gene IDs to gene names after Google searching, you can post that as another question on Biostars.

ADD REPLYlink written 6 days ago by dsull1.5k

Sure! Thank you for your help.

ADD REPLYlink written 6 days ago by Manoj60

Hi, please let me know how to RUN kallisto for multiple files. I am trying this loop, but I think it is not correct.

for file in *.fastq.gz; do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"; done

ADD REPLYlink written 7 days ago by Manoj60

Your reads are paired-end (i.e. you have two fastq files per sample).

Let's say each sample has a _R1.fastq.gz and an _R2.fastq.gz file associated. You'd want to do this (basically, extracting all the sample names by removing the _R1.fastq.gz and _R2.fastq.gz via cut):

for file in $(ls *.fastq.gz|rev|cut -c 13-|uniq|rev); do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"_R1.fastq.gz "${file}"_R2.fastq.gz; done

ADD REPLYlink written 7 days ago by dsull1.5k

FYI, if you want to speed things up, you can use more threads for kallisto via the -t flag. For example, if you want to use 8 parallel threads, you can do kallisto quant -t 8 (followed by the rest of the kallisto command).

ADD REPLYlink written 7 days ago by dsull1.5k

If you have 180 BAM files and you want to construct an alignment matrix from them you will need to use something similar to the command below:

./featureCounts -T 5 -t exon -g gene_id -a ../home/kumarm/RNA-Seq/chr.gtf -o ../home/kumarm/RNA-Seq/counts.txt BAM1 BAM2 BAM3 BAM4 ... BAM180

This will create a matrix of counts with genes in rows and the sample in columns.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax87k

you can follow salmon-deseq2 and salmon-wasabi-sleuth pipeline for RNAseq here: https://digibio.blogspot.com/2018/05/rna-seq-data-analysis-using-salmon.html

ADD REPLYlink written 7 days ago by cpad011213k
8
gravatar for genomax
19 days ago by
genomax87k
United States
genomax87k wrote:
  1. Do basic QC (fastqc).
  2. Align using an appropriate aligner (use splice aware aligners like STAR, bbmap, HISAT2 if needed for your organism) or use salmon (LINK) to quantify against a transcriptome.
  3. Count using featureCounts
  4. Use DESeq2 to do DE analysis.
ADD COMMENTlink modified 19 days ago • written 19 days ago by genomax87k
4
gravatar for Mark
19 days ago by
Mark800
Mark800 wrote:

IMO the edgeR package is amazing, has great documentation and is highly validated. A really good tutorial for differential expression is this one:

https://f1000research.com/articles/5-1438

For complete tutorial theres: https://github.com/griffithlab/rnaseq_tutorial

and: https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_dge_basic/rna_seq_basic_tutorial/

There's a massive amount of info online, google is your friend

ADD COMMENTlink written 19 days ago by Mark800

Thank you! It is really helpful.

ADD REPLYlink written 17 days ago by Manoj60
4
gravatar for brunobsouzaa
17 days ago by
brunobsouzaa230
Brazil
brunobsouzaa230 wrote:

1 - FastQC;

2 - STAR alignment;

3 - Get ReadCounts using Rsubreads;

4 - Differential expression using DEseq2 and EdgeR;

5 - Pathway analysis (Panther, GSEA, etc...);

ADD COMMENTlink modified 16 days ago • written 17 days ago by brunobsouzaa230

Is it Ok to use Hisat2 instead of STAR for aligning?

ADD REPLYlink written 16 days ago by Manoj60

I have used a couple of different genome aligners. Among all liked STAR the most and provide good and consistent results. Take a look at the following thread: HISAT2 V.S. STAR

STAR is better in general (but again depends on what you really want to do) for genome alignment and it is easy to use too! If your goal eventually is to do differential gene expression analysis and pathway analysis, I would probably use STAR for alignment.

If you are confirmable in using Hisat2, I don't see a problem using that either.

ADD REPLYlink modified 16 days ago • written 16 days ago by jkkbuddika70

Hi, will I need to write a script or prepare bash file to RUN STAR with multiple fastq.gz files or STAR has their own command to RUN all fastq files with the single command? I have a total 180 fastq.gz files.

ADD REPLYlink written 15 days ago by Manoj60

You can write a simple for loop to run all of your samples.

ADD REPLYlink written 7 days ago by brunobsouzaa230
3
gravatar for jkkbuddika
18 days ago by
jkkbuddika70
Bloomington, USA
jkkbuddika70 wrote:

Simply, (1) Check the quality of raw sequences (FastQC), (2) Remove adapters and low quality reads (this is OPTIONAL and can use a tool like Cutadapt), (3) map to a reference genome of interest (STAR), (4) prepare count matrices (Subread featureCounts), (5) differential gene expression analysis using R package DESeq2 or edgeR and (6) pathway analysis using a program like clusterProfiler.

I have written a complete pipeline that I use for my RNA-seq data analyses: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer (This pipeline takes care of steps 1-4)

The repository also contain a detailed step-by-step user guide that would walk you through getting things started. The Python pipeline eventually generates count tables using featureCounts and additionally bedgraphs for IGV visualization using DeepTools. These final count tables can be used as an input for DESeq2 or edgeR to do the differential gene expression analysis. Let me know if you need any help.

ADD COMMENTlink modified 17 days ago • written 18 days ago by jkkbuddika70

Thank you for your help! I will take a look and get back to you if I need help.

ADD REPLYlink written 17 days ago by Manoj60

Hi, Does the pipeline able to perform differential gene analysis? Do I need to download all Softwares described in the README file? Can you tell me how to RUN this pipeline? I have 110 paired reads samples.

ADD REPLYlink written 17 days ago by Manoj60

Yes! Take a look at the User guide: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer/blob/master/USERGUIDE.md

User guide describes 4 simple steps to get you going. Yes! You have to have all the programs installed and have sufficient memory to run the pipeline. I usually install all programs using miniconda (see Step 1 of the user guide).

The pipeline analyze both single and paired end sequencing files. You just have to specify which mode you want to analyze. Again, let me know if you need any more assistance.

The pipeline itself will not perform the differential gene expression analysis. You have to use an R package such as DEseq2 or edgeR to do that. The pipeline generates count tables that are the input for these packages. Let me know if you need any help with this too.

Cheers!

ADD REPLYlink modified 17 days ago • written 17 days ago by jkkbuddika70

Hi, I have reported an issue at Github, please go through it.

ADD REPLYlink written 14 days ago by Manoj60

Responded! Take a look.

ADD REPLYlink written 14 days ago by jkkbuddika70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1505 users visited in the last hour