Question: transcripts sequence preparation for RNAseq DE analysis
0
gravatar for yifangt86
2.1 years ago by
yifangt8610
USA
yifangt8610 wrote:

I am confused with the technical part to discriminate the transcript-level and gene-level in RNA-seq analysis. The biggest part is to get transcripts sequences as reference for indexing step (kallisto/salmon) from the gff3 file.

In following example partial gff3 file, the exon1 is the same as utr5p1, and exon2 is the same as utr5p2 by the coordinates.

chr1A   Acsv1.0_UTR gene    11085   42819   31  +   .   ID=Acs1A01G010LC;name=Acs1A01G010LC;primconf=LC
 chr1A  Acsv1.0_UTR mRNA    11085   42819   .   +   .   ID=Acs1A01G010LC.1;Parent=Acs1A01G010LC;Name=Acs1A01G010LC.1;Note=Acs1A01G010LC;primconf=LC;secconf=LC2
 chr1A  Acsv1.0_UTR exon    11085   11501   .   +   .   ID=Acs1A01G010LC.1.exon1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  11085   11501   .   +   .   ID=Acs1A01G010LC.1.utr5p1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    23483   23914   .   +   .   ID=Acs1A01G010LC.1.exon2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  23483   23914   .   +   .   ID=Acs1A01G010LC.1.utr5p2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  40929   41201   .   +   .   ID=Acs1A01G010LC.1.utr5p3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    40929   42819   .   +   .   ID=Acs1A01G010LC.1.exon3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR CDS     41202   41522   .   +   0   ID=Acs1A01G010LC.1.cds1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 3'_UTR  41523   42819   .   +   .   ID=Acs1A01G010LC.1.utr3p1;Parent=Acs1A01G010LC.1

My questions are:

  1. Should the redundant exon/5'_UTR entries be excluded for the transcripts sequence? as I am not sure the redundant sequence will cause the reads counts doubled eventually for gene-level expression if I use kallisto/sleuth for DE analysis, for example.
  2. Similarly, should the CDS be excluded from this gff3 file to extract the transcripts for the reference sequence because I already have a separate fasta file for CDS sequences?

Thanks a lot!

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by yifangt8610

It is generally better to align to the genome and then count as needed. You could also look at Salmon/Kallisto which do alignment free estimations of transcript abundance.

ADD REPLYlink written 2.1 years ago by genomax71k

Thanks!

Kallisto manual indicates to use transcripts for index. Could you please share your idea to get transcript-level / gene-level from the indexed whole genome, if possible? Thanks again.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by yifangt8610

Once you do the alignments to genome you can use a program like featureCounts (or HTseq-count) to summarize your data at different levels. Take a look at the featureCounts manual to discover the options in chapter 6 (page 31).

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax71k
1
gravatar for h.mon
2.1 years ago by
h.mon27k
Brazil
h.mon27k wrote:

If you intend to use kallisto or Salmon, there is no need to you should not remove the redundant regions shared between different transcripts of the same gene, as these programs use an expectation-maximization algorithm to best estimate how to apportion shared counts to transcripts. So you should have a fasta with one sequence per transcript, with its whole transcribed region.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by h.mon27k
1
gravatar for yifangt86
2.1 years ago by
yifangt8610
USA
yifangt8610 wrote:

Thanks genomax and h.mon!

I found out the solution for my problem with gffread, and I post it here for those who have the same problem.

   gffread $gff3_file -g $genome.fasta -w transcripts.fasta
    -w  write a fasta file with spliced exons for each GFF transcript   #So that the UTRs are included
    -x  write a fasta file with spliced CDS for each GFF transcript

The -w option resolved my issue, so that I can have one sequence including the UTR regions for each transcript for kallisto.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by yifangt8610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour