Question

transcripts sequence preparation for RNAseq DE analysis

0

Entering edit mode

6.7 years ago

yifangt86 ▴ 60

I am confused with the technical part to discriminate the transcript-level and gene-level in RNA-seq analysis. The biggest part is to get transcripts sequences as reference for indexing step (kallisto/salmon) from the gff3 file.

In following example partial gff3 file, the exon1 is the same as utr5p1, and exon2 is the same as utr5p2 by the coordinates.

chr1A   Acsv1.0_UTR gene    11085   42819   31  +   .   ID=Acs1A01G010LC;name=Acs1A01G010LC;primconf=LC
 chr1A  Acsv1.0_UTR mRNA    11085   42819   .   +   .   ID=Acs1A01G010LC.1;Parent=Acs1A01G010LC;Name=Acs1A01G010LC.1;Note=Acs1A01G010LC;primconf=LC;secconf=LC2
 chr1A  Acsv1.0_UTR exon    11085   11501   .   +   .   ID=Acs1A01G010LC.1.exon1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  11085   11501   .   +   .   ID=Acs1A01G010LC.1.utr5p1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    23483   23914   .   +   .   ID=Acs1A01G010LC.1.exon2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  23483   23914   .   +   .   ID=Acs1A01G010LC.1.utr5p2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  40929   41201   .   +   .   ID=Acs1A01G010LC.1.utr5p3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    40929   42819   .   +   .   ID=Acs1A01G010LC.1.exon3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR CDS     41202   41522   .   +   0   ID=Acs1A01G010LC.1.cds1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 3'_UTR  41523   42819   .   +   .   ID=Acs1A01G010LC.1.utr3p1;Parent=Acs1A01G010LC.1

My questions are:

Should the redundant exon/5'_UTR entries be excluded for the transcripts sequence? as I am not sure the redundant sequence will cause the reads counts doubled eventually for gene-level expression if I use kallisto/sleuth for DE analysis, for example.
Similarly, should the CDS be excluded from this gff3 file to extract the transcripts for the reference sequence because I already have a separate fasta file for CDS sequences?

Thanks a lot!

rna-seq transcript-leve gene-level transcripts • 2.6k views

ADD COMMENT • link 6.7 years ago by yifangt86 ▴ 60

0

Entering edit mode

It is generally better to align to the genome and then count as needed. You could also look at Salmon/Kallisto which do alignment free estimations of transcript abundance.

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks!

Kallisto manual indicates to use transcripts for index. Could you please share your idea to get transcript-level / gene-level from the indexed whole genome, if possible? Thanks again.

ADD REPLY • link 6.7 years ago by yifangt86 ▴ 60

0

Entering edit mode

Once you do the alignments to genome you can use a program like featureCounts (or HTseq-count) to summarize your data at different levels. Take a look at the featureCounts manual to discover the options in chapter 6 (page 31).

ADD REPLY • link 6.7 years ago by GenoMax 141k

score 1 · Answer 1 · 2017-08-16

If you intend to use kallisto or Salmon, ~~there is no need to~~ you should not remove the redundant regions shared between different transcripts of the same gene, as these programs use an expectation-maximization algorithm to best estimate how to apportion shared counts to transcripts. So you should have a fasta with one sequence per transcript, with its whole transcribed region.

score 1 · Answer 2 · 2017-08-17

Thanks genomax and h.mon!

I found out the solution for my problem with gffread, and I post it here for those who have the same problem.

   gffread $gff3_file -g $genome.fasta -w transcripts.fasta
    -w  write a fasta file with spliced exons for each GFF transcript   #So that the UTRs are included
    -x  write a fasta file with spliced CDS for each GFF transcript

The -w option resolved my issue, so that I can have one sequence including the UTR regions for each transcript for kallisto.