Count Tables For Deseq [After Alignreads.Pl (Trinity)]
4
2
Entering edit mode
9.8 years ago
lzsph ▴ 70

Hey guys,

We sequenced a plant without a reference genome. We got 100bp paired-end reads using HiSeq 2000 (two replicates each condition), put all sequencing reads together and did de novo assembly using Trinity to get a "reference" transcriptome.

I aligned the individual reads back to the transcriptome, using the alignReads.pl script within Trinity package. The command I used was:

\$/PATH/TO/SCRIPTS/alignReads.pl --left Sample1_R1.fastq --right Sample1_R2.fastq --seqType fq --target REFERENCE_transcriptome.fasta --aligner bowtie

As a result, I got many results' names and sizes listed below, for one sample.

bowtie_out.coordSorted.bam 4.56GB

bowtie_out.coordSorted.bam.bai 12.7MB

bowtie_out.nameSorted.bam 3.86GB

bowtie_out.nameSorted.PropmapPairsForRSEM.bam 3.38GB

others are 0 bytes in size

My questions is, I want to get differential expressed genes (rather than transcripts) using DESeq, I should get count tables first. But I don't know which data listed above should be used and how to generate count tables? Such as using HTSeq or other programs or scripts.

Simon said, "If you must align against the transcriptome, make sure that you count for genes, not transcripts, and remove reads mapping to transcripts from more than one gene." And I just want to count for genes.

If I am wrong or not using softwares properly with statements above, critics and suggestions are greatly appreciated.

Thank you very much!

Yours sincerely, -lzsph

trinity deseq • 6.0k views
1
Entering edit mode
9.8 years ago

I have never used Trinity so I can't give you the exact answer. Normally when you align your RNAseq reads against a reference genome, the program will give you a bam file and a tab file that will have the counts and RPKM value at the gene or transcript level for all the genes or transcripts you had provided to the program in the form of a gtf/gff3 file. DESeq, edgeR or DEGseq then take the counts or RPKM files as an input (counts file is more preferred). Count refers to the number of short reads aligned against a gene. RPKM is the "Count" value normalized by the length of the gene and the total number of mapped reads for the sample. I would use bowtie_out.nameSorted.PropmapPairsForRSEM.bam file and use "htseq-count" script that takes gtf file and the bam file as an input and will produce the count file which is required by DEGseq, DESeq or edgeR. Though I have never used Trinity but I think it should produce the RPKM or "Count" file for you. Are you supposed to provide gtf/gff3 file to Trinity ?

0
Entering edit mode

Hi ashuto,

Thanks for your reply. Since we don't have a reference genome available, we put all sequencing reads together to assemble a transcriptome as reference. Count tables generated by "htseq-count" script are easy for those species with a reference genome, which we don't have. I don't have any gtf/gff3 files. And I want to count for genes rather than transcripts. Have you get any idea of it?

Regards,

lzsph

0
Entering edit mode
9.8 years ago

I would use RSEM as outlined in the tutorial where it also says you will get counts on both gene and isoform levels. You could then use ebSeq (which directly supports RSEM output) or round your counts to the closest integer and use DESeq.

0
Entering edit mode

Thank you Mikael,

I will give it a try.

Regards,

lzsph