Question: How to calculate the different expression gene used PacBio full-length cDNA sequencing and Illumina sequencing?
gravatar for xzpgocxx
3.4 years ago by
xzpgocxx0 wrote:

Now, I have some data from PacBio full-length cDNA sequencing and Illumina sequencing. And I want to calculate the gene expression leave (FPKM). There is no available genome for my species. I can do it used Trinity if only have Illumina data, but now I don't know how to do it when adding PacBio full-length cDNA reads? Thanks.

rna-seq • 2.1k views
ADD COMMENTlink modified 3.1 years ago by tjduncan270 • written 3.4 years ago by xzpgocxx0

As far as I know, if you have the PacBio full-length cDNA , you already have the transcripts. Most of the PacBio analysis tools generates a GTF file at the end. I would generate the gene models from the transcript models and quantify the genes using illumina reads to get the expression levels of genes. Its bit tricky, I will leave it for you to think.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by geek_y11k
gravatar for tjduncan
3.1 years ago by
Indianapolis, IN
tjduncan270 wrote:

Since you don't have a reference genome/transcriptome you would need to generate your sample specific genemodels (ie: generate your own reference transcriptome from your pacbio data) then quantify the expression levels of your illumina data using your newly generated pacbio reference transcriptome with your preferred short-read gene expression pipeline.

To get started on making your own reference transcriptome from you pacbio data I would use their iso-seq pipeline. The link below that outlines the major steps of the pipeline, dependencies, and other tertiary pacbio iso-seq data analysis tools that may be useful.

-Iso-Seq Command Line Module from SMRT Link v4

"It includes includes three major steps:

  1. CCS: Getting CCS (circular consensus sequence) reads out of subreads BAM file.
  2. Classify: Identifying full-length CCS reads based on cDNA primers and polyA tail signal.
  3. Cluster: Isoform-level clustering and polishing to generate high-quality, full-length, transcript isoform sequences."

Once you have completed step 3 you should have an the output: hq_isoforms.fastq

This output includes the transcript sequences that are:

  • "Full-length (as indicated by presence of cDNA primers)
  • High-quality (predicted accuracy by default is >= 99%)
  • Supported by 2 or more FL reads (unless you changed the default or are using older versions)."

From there you have several options. An overview can be seen:

For your goal of adding pacbio to illumina data to calculate gene expression levels without a reference you would likely want to collapse the high quality isoforms into a single set of unique isoforms. To do that you could use Cogent or CD-Hit:

From there you should have a reference trascriptome that you could preform your preferred illumina data pipeline on.

ADD COMMENTlink written 3.1 years ago by tjduncan270

Thanks, It's really the best guide to solve this question. Let me try.

ADD REPLYlink written 3.1 years ago by xzpgocxx0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1701 users visited in the last hour