Question: How to do sequencing reads technical analysis and biological quality analysis for RNASeq dataset?
0
gravatar for nalandaatmi
4.1 years ago by
nalandaatmi70
United States
nalandaatmi70 wrote:

Hi all,

I have recently started to work on RNASeq analysis. I need to do the following two aspects of analysis first, before performing the TopHat pipeline for RNASeq. I have performed demultiplexing step and also generated the fastq files using basecalls from HiSeq. 

Can you guys explain me why these analyses are important to do first hand and how to proceed further?

 

A.      the sequencing reads technical analysis:  I have to perform a genome wide alignment using the RNA_seq data sets of lane 1 to lane 6,  and I have to output the information on the sequencing reads technical analysis like: 

1. The reads duplication analysis;

2. The contamination analysis of the Illumina adaptor sequences;

3. The GC content  analysis.

B.      the biological quality analysis: using the mapping results above, also I need to output the biological quality analysis of  the data sets like:

1. The percentage of the sequencing reads derived from the rRNA genes;

2. The percentage of the sequencing reads derived from the globin gene;

3.  Because this is a strand specific RNA-seq, I have to include the sense and antisense information for the corresponding genes.

rna-seq alignment next-gen • 1.7k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by nalandaatmi70

Why cant the person tasking you also explain the rationale behind these orders?

ADD REPLYlink written 4.1 years ago by Ido Tamir5.0k

The person tasking you with these really shouldn't. Aside from (A), which can be done entirely with FastQC, there are often nuances with how things should be implemented and you would need to be quite comfortable with RNAseq data before dealing with this.

Also, use a different sequencing facility next time. Needing to demultiplex things yourself is absolutely absurd.

ADD REPLYlink written 4.1 years ago by Devon Ryan92k

We sometimes do our own demultiplexing because we use barcode setups that the core doesn't like, especially in development of new in-line barcoding products. They'll set up a new demultiplexing for us but we don't ask until the thing is done.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Michele Busby2.1k

Sure, but it doesn't sound like nalandaatmi is working on a new method.

ADD REPLYlink written 4.1 years ago by Devon Ryan92k

Dear Devon, 

Can you explain me about the nuances with regards to RNAseq or redirect to some links where I can find.

Why do you say demultiplex things is absurd?

ADD REPLYlink written 4.1 years ago by nalandaatmi70

Making end users demultiplex standard data is absurd because that's a lot of extra work to get things set up when the sequencing facility could just do it as part of a standard pipeline. I've used a number of core facilities and companies over the years and have never needed to demultiplex things as a customer (I do now, but I'm not the customer any more :) ).

Regarding RNAseq, that's a long discussion. You'd be well advised to work together with someone locally the first time you do a new type of analysis like this (at least until you get a fair bit of experience under your belt).

ADD REPLYlink written 4.1 years ago by Devon Ryan92k

Devon, I am learning the NGS stuffs in a sequencing facility. The sequencing incharge person gave me the files which are directly from the HiSeq sequencing machine. I am interested in learning from the very first step of NGS reads. That's why I mentioned, I did perform demultiplexing step and generated fastq files from base calling files. I am trying to understand what are all the steps involved before downstream analysis.

As you mentioned that you do these NGS demultiplexing stuffs now, I would like to ask you this query. Using (bcltofastq) program I converted the base calls files to fastq files. When the fastq files are generated it has naming convention like these

- WES01_AGTCCA_L001_R1_001.fastq, WES01_AGTCCA_L001_R1_002.fastq, WES01_AGTCCA_L001_R1_003.fastq,........WES01_AGTCCA_L001_R1_010.fastq

- WES01_AGTCCA_L001_R2_001.fastq, WES01_AGTCCA_L001_R2_002.fastq, WES01_AGTCCA_L001_R2_003.fastq,........WES01_AGTCCA_L001_R2_010.fastq

WES01- is sample name, AGTCCA - barcode or index, L001 - Lane 1, R1 - Forward reads, R2 - Reverse reads, what is 001, 002, 003 to 010 after R1 and R2?

ADD REPLYlink written 4.1 years ago by nalandaatmi70

@nalandaatmi in my experience, the ' 001.fastq, 002.fastq, 003.fastq, ... ' you are referring to usually means that the fastq file was split into smaller parts. So if you merged the files together end-to-end, you would get all the reads.

ADD REPLYlink written 3.5 years ago by steve2.3k
0
gravatar for Alternative
4.1 years ago by
Alternative230
Alternative230 wrote:

This is a whole analysis pipeline that you need. For "A", fastqc http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ can help. For "B", you have to deal with combinations of bash/python/perl scripting and you can get what you need.

Why is "A" important: this is obvious, you need to check the quality of your sequences before deriving any biological hypothesis based on them.

Why is "B" important: I guess the person seems to "know" what to expect and he/she needs to have some "controls". Why not.

ADD COMMENTlink written 4.1 years ago by Alternative230
0
gravatar for nalandaatmi
4.1 years ago by
nalandaatmi70
United States
nalandaatmi70 wrote:

Thanks Tamir and Ryan for your suggestions.

Dear Pierre,

Thanks for your explanation. Yeah I have done that fastqc analysis, I am investigating these sections in the fastqc.html file for A section.

1) GC content

2) Sequence duplication levels

3) Overrepresented sequences

B section, working on it.

ADD COMMENTlink written 4.1 years ago by nalandaatmi70
0
gravatar for nalandaatmi
4.1 years ago by
nalandaatmi70
United States
nalandaatmi70 wrote:

Hi All,

For section B, I am planning to take a list of ribosomal RNA genes and align it with my sample reads using bowtie2 tool. I assume the overall alignment rate which bowtie2 outputs will be the percentage of reads matching ribosomal RNA genes. Am I correct?

 

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by nalandaatmi70

Open a new question. This is not a discssion forum but a Q&A. 1 Question + Answers, not 1 Question + Comments + Answers + more Questions.

ADD REPLYlink written 4.1 years ago by Ido Tamir5.0k

Apologies, I thought I am still following up with my section b of my first question. Hereafter, I will make it a separate query. Thanks for letting me know about it.

ADD REPLYlink written 4.1 years ago by nalandaatmi70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2474 users visited in the last hour