How to calculate the different expression gene used PacBio full-length cDNA sequencing and Illumina sequencing?
Entering edit mode
5.3 years ago
xzpgocxx ▴ 20

Now, I have some data from PacBio full-length cDNA sequencing and Illumina sequencing. And I want to calculate the gene expression leave (FPKM). There is no available genome for my species. I can do it used Trinity if only have Illumina data, but now I don't know how to do it when adding PacBio full-length cDNA reads? Thanks.

RNA-Seq • 3.0k views
Entering edit mode

As far as I know, if you have the PacBio full-length cDNA , you already have the transcripts. Most of the PacBio analysis tools generates a GTF file at the end. I would generate the gene models from the transcript models and quantify the genes using illumina reads to get the expression levels of genes. Its bit tricky, I will leave it for you to think.

Entering edit mode
4.9 years ago
tjduncan ▴ 270

Since you don't have a reference genome/transcriptome you would need to generate your sample specific genemodels (ie: generate your own reference transcriptome from your pacbio data) then quantify the expression levels of your illumina data using your newly generated pacbio reference transcriptome with your preferred short-read gene expression pipeline.

To get started on making your own reference transcriptome from you pacbio data I would use their iso-seq pipeline. The link below that outlines the major steps of the pipeline, dependencies, and other tertiary pacbio iso-seq data analysis tools that may be useful.

-Iso-Seq Command Line Module from SMRT Link v4

"It includes includes three major steps:

  1. CCS: Getting CCS (circular consensus sequence) reads out of subreads BAM file.
  2. Classify: Identifying full-length CCS reads based on cDNA primers and polyA tail signal.
  3. Cluster: Isoform-level clustering and polishing to generate high-quality, full-length, transcript isoform sequences."

Once you have completed step 3 you should have an the output: hq_isoforms.fastq

This output includes the transcript sequences that are:

  • "Full-length (as indicated by presence of cDNA primers)
  • High-quality (predicted accuracy by default is >= 99%)
  • Supported by 2 or more FL reads (unless you changed the default or are using older versions)."

From there you have several options. An overview can be seen:

For your goal of adding pacbio to illumina data to calculate gene expression levels without a reference you would likely want to collapse the high quality isoforms into a single set of unique isoforms. To do that you could use Cogent or CD-Hit:

From there you should have a reference trascriptome that you could preform your preferred illumina data pipeline on.

Entering edit mode

Thanks, It's really the best guide to solve this question. Let me try.


Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6