Tutorial: (Closed) InsideDNA: RNA-Seq de novo assembly using Trinity
gravatar for shilparaopradeep
4.5 years ago by
shilparaopradeep190 wrote:

RNA-seq de novo assembly is one the most frequent type of sequence analysis in biology and bioinformatics. However, just as a complete genome assembly, RNA-seq assembly is not trivial and often requires large amount of RAM and CPUs. In this tutorial, we explain how to use one the most popular RNA-seq assemblers - Trinity. For the assembly pipeline and results interpretation we use transcripts from a ganglion of medicinal leech from this research. In the original study, RNA-seq data helped to understand how gene expression changes along the central nervous system of the species and to affiliate location with gene behavior.

In this tutorial we are going to explore several steps: first you will learn how to automatically obtain data from NCBI with fastq-dump utility. then you will explore how to use Terminal and an interface to run Trinity. Once the RNA-seq are assembled, we will show several ways to assess the quality of transcriptomic assembly to get you started. Now, let's dive into RNA-seq analysis.

  1. Download raw data for Hirudo medicinalis To start a new project, login to InsideDNA, navigate to Files tab and create a new folder called “Hirudo”. enter image description here Now open Terminal. To upload raw RNA-seq data into this folder we will use a special utility, called fastq-dump, which allows to directly transfer the data from the NCBI archive into your storage space on InsideDNA. Type the following command in Terminal window:

idna_submit.py -t fastqdump -c 8 -r 7.2 -e "idna_fastq-dump -F -O /data/userXXX/Hirudo/ SRR799260"

enter image description here SRR799260 is a code of a sample which we will download from the Sequence Read Archive. Such codes could be found in a Methods section of research papers. /data/userXXX/Hirudo/ is a name of a folder, where data will be uploaded and fastq-dump is a name of a tool which will find file in a database, download it and transform into fastq format. You can monitor the progress of your task in Tasks Tab. When the task is completed, you will find the file with reads in your Hirudo folder. enter image description here

  1. Run Trinity to assemble transcripts from RNA-seq reads To assemble reads into transcripts we will use Trinity tool. We will demostrate how to run Trinity via terminal and via Tool interface.

2.1. Running Trinity via an interface First, let look at the Trinity interface. Go to Files, type in a search box "trinity" and hit search button. You see that search returned a Trinity tool. Press Run. enter image description here

On tool setting page, specify Task name and input fastq files. We are going to use singe reads, therefore we activate option "Input file(s) with single reads ( --single)", click browse button and navigate to our Hirudo folder and select fastq file. enter image description here

Now we need to specify few more parameters specific to our data. First, we have to set the memory requirements: Trinity is one of the most RAM-hungry tools out there, so we will use InsideDNA cloud capacity in its full extent - put 60G there. You must put G suffix (!), if you don't - the program will fail. Also, we will use 16 cores, so specify 16 in the next parameter field.

Next, we use fastq files, so we keep next parameter unchanged.

Click Show secondary parameter and scroll down until you see parameter "Run in silico normalization of reads" and select from a dropdown menu --normalize_reads. Also, change parameter "Retain all intermediate input files" from dropdown menu --no_cleanup

enter image description here

Now we are ready to submit the task. We still have to specify RAM and CPU for the task. Select (by dragging to the right a slidebar) 16 CPU and then, in a similar fashion, drag RAM till 60.

Prewiew the task and click Submit. You can monitor task progress in Tasks. 2.2. Running Trinity via a terminal To run Trinity from the terminal, we need to create a small bash script. This script will create temporary folders to speed up Trinity work. If you don’t want to write script yourself, you can download it using this link, modify its content on your local machine and then upload to InsideDNA. We will use Vi editor in a Terminal to write script. To create an empty file and to run Vi, type the following command in Terminal: vi /data/userXXX/Hirudo/trinity.sh Press button with I letter (capital i) on your keyboard to enter edit mode in Vi. Then type (or copy-paste) the following lines in Terminal:

! /bin/bash

work_dir=/tmp/trinity_in/ mkdir $work_dir cp /data/userXXX/Hirudo/SRR799260.fastq $work_dir /srv/dna_tools/trinityrnaseq-2.1.1/Trinity --seqType fq --max_memory 60G --single ${work_dir}SRR799260.fastq --CPU 16 --no_cleanup --normalize_reads --output /tmp/trinity_out_dir/ tar -zcvf /tmp/trinity.tar.gz /tmp/trinity_out_dir mv /tmp/trinity.tar.gz /data/userXXX/trinity_tmp/

enter image description here enter image description here

enter image description here

enter image description here

Don’t forget to replace XXX with your own userID. Press Esc button to quit editing mode, and then type wq to save your edits and quit Vi editor. Now you have a bash script called trinity.sh in Hirudo folder.

enter image description here

The last thing you need to do before you run Trinity, is to create a new folder called trinity_tmp in your root folder.

enter image description here

To run Trinity using your script, type the following command in Terminal:

idna_submit.py -t trinity_sh -c 16 -r 60 -e "bash /data/userXXX/Hirudo/trinity.sh"

-c 16 and –r 60 here means that you allocate 16 cores CPU and 60 Gb RAM to run you task. With these resources it will take Trinity about one hour to assemble transcripts from reads. 3. Analyzing Trinity output

When the task is completed, you will discover archive called trinity.tar.gz in trinity_tmp folder. To unzip it, run the following command in Terminal:

cd Hirudo

tar –zxvf trinity.tar.gz

When archive is unpacked, you will see, that it contains plenty of files.

enter image description here

The most interesting file, containing assembled transcripts, is called Trinity.fasta. You can explore it using less command:

less /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta

enter image description here

Assembled transcripts, or clusters of reads, have special headers in this file:

TRINITY_DN2_c0_g1_i1 len=408 path=[801:0-377 802:378-407] [-1, 801, 802, -2]

Here TRINITY_DN2_c0 is a read cluster, g1 is a gene and i1 is an isoform. These codes have nothing to do with gene IDs from ENSEMBL or other databases, and serve just to describe relationships between clusters of reads.

How do we assess the quality of transcripts assembly? First thing we can do is to estimate N50 parameter, characterizing length of contigs.

Type the following command in console:

idna_submit.py -t trinity_stats -c 8 -r 7.2 -e "/srv/dna_tools/trinityrnaseq-2.1.1/util/TrinityStats.pl /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta > /data/userXXX/Hirudo/stats.txt"

When this task is finished, you can view output file with statistics:

less /data/userXXX/Hirudo/stats.txt

enter image description here

The other thing we can do is to estimate what proportion of reads contributed to the assembly. For this purpose we can align source reads on assembled transcripts.

First, we need to create an index files for Trinity.fasta:

idna_submit.py -t bowtie2build -c 4 -r 3.6 -e "idna_bowtie2-build /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta /data/userXXX/trinity_tmp/tmp/trinity_out_dir/index"

When this task is finished, we are going to run Bowtie tool to make read mapping. You can find a detailed tutorial about Bowtie and read mapping in our previous tutorial Mapping sequencing reads to a reference genome with Bowtie2: a step-by-step guide

idna_submit.py -t bowtie -c 4 -r 3.6 -e "idna_bowtie2 -x /data/userXXX/trinity_tmp/tmp/trinity_out_dir/index -U /data/userXXX/Hirudo/SRR799260.fastq -S /data/userXXX/Hirudo/trinity.sam 2>/data/user410/Hirudo/bowtie_stats.txt"

After running this task you will obtain an alignment called trinity.sam. Yet the most interesting part for us is the statistics of this alignment in file bowtie_stats.txt. You can view this file, typing such command in Terminal:

less /data/userXXX/Hirudo/bowtie_stats.txt

enter image description here

As you can see, about 88% or reads aligned to assembled transcripts, which means, that a very large part of our source data contributed to the Trinity assembly (this is a good sign!)

Congratulations, now you mastered the first steps of transcriptome research! Next steps, which could include identification of coding regions, their functional annotation, search of homologues or analysis of differential expression will be covered in future Tutorials. Stay tuned!

tutorial bioinformatics genome • 2.6k views
ADD COMMENTlink written 4.5 years ago by shilparaopradeep190

Hello shilparaopradeep!

We believe that this post does not fit the main topic of this site.

The post is misleading, not useful wrt the original application of a too, and a verbatim copy of external material. OP has never responded to any request to clarify authorship.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink written 4.0 years ago by Michael Dondrup48k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour