Question

questions on rna-seq data analysis

0

Entering edit mode

9.3 years ago

poisonAlien ★ 3.2k

Hi all I have some doubts on rna-seq analysis. I have rna-seq data (paired end using Truseq) from patients - 4 tumor and one normal (not matched normal). I am trying to find differentially expressed genes. So far so good - aligned with tophat2, estimated expression with cufflinks, got counts with ht-seq count, did differential expression using DESeq2. But here are my concerns -

While counting with htseq-count I am getting very low counts assigned to features. On an average only 30% of the fragments are assigned to features (refseq gtf). Maximum of the remaining 70% fragments are ambiguous or no feature. Is it common to find such low percentage of counts? I read one could get around 60%. I tried with all modes in htseq (reverse union, stranded union, unstranded union and with intersection nonempty) and the best result was with unstranded union mode.

For some of the features ht-seq count assigned zero counts. For example between region chr20:3776385-3786768 htseq gives -

3919 SAM alignment pairs processed.
NM_001287516    0
NM_001287517    0
NM_001287518    0
NM_001287519    0
NM_001287520    0
NM_001287522    0
NM_001287524    5
NM_004358    0
NM_021872    0
NM_021873    0
__no_feature    82
__ambiguous    3683
__too_low_aQual    0
__not_aligned    0
__alignment_not_unique    149

Of ~3900 pairs 3683 are ambiguous and only 5 are assigned to one of the feature. I get this, since there are so many overlapping genes within 10kb, there is no way to tell which gene a fragment originated from, hence so many ambiguous. But when I used cufflinks for the same (some of) genes I get high FPKM values -

NM_001287516    1.69492e-71
NM_001287517    9.62116e-75
NM_001287518    1.05537e-103
NM_001287519    0.000182958
NM_001287520    31.2244
NM_001287522    6.82458
NM_001287524    1.72693
NM_004358    1.47296
NM_021872    0.017841
NM_021873    17.0096

Does this mean cufflinks does not account for ambiguous reads while calculating FPKM?

Some of the threads I saw, say that for trueseq libraries we should use fr-firststrand as library type during tophat2 alignment. But here it is metioned to usefr-unstranded (I guess it should also be considered while using htseq). Which is the right way?
Since I do not have replicates/samples for normal tissue, using DESeq2 is advisable or should I be better off with fold changes from FPKM?

Thanks

deseq2 htseq-count cufflinks fpkm RNA-Seq • 4.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by poisonAlien ★ 3.2k

Ram · Accepted Answer · 2014-12-31

2

Entering edit mode

9.3 years ago

michael.ante ★ 3.8k

Hi PoisonAlien,

I fear you are counting reads on transcript-level, but htseq-count is rather designed for gene-counting. It counts reads falling unambiguously to defined regions (exons). Since most transcripts of a given gene share some exons, the reads of such shared exons cannot be assigned to a distinct transcript and are therefore counted as "ambiguous".
Cufflinks on the other hand tries to resolve the transcripts' abundances per gene with a coverage-based approach. A single read can thereby assigned partially to different transcripts.
Use RSeQC to get the correct library type (see here)
It is always hard without replicates/samples, and I'd guess even harder with different sample sizes in your case. I would go for cuffdiff first. Maybe you find similar data (organ, sex, library prep., sequencer, etc.) in public available databases like GEO, SRA, or GXA in order to have equal sample sizes.

Cheers,
Michael

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Hi Michael,

Thank you for the reply.

I am using htseq-count in default mode (gene level). I checked annotation for above transcripts (NM_001287516 to 24) they all map to gene CDC25B but they are all duplicates. ucsc refflat gtf has same gene_id (CD25B) for all above transcripts but ucsc_refseq gtf has different gene_id (as above) for all transcripts. This is sort of confusing and results different outputs. What is the best gtf to use in this case? Ensembl also has different tx_id and gene_id similar to refseq.

EDIT: Sorry, Ensembl gtf works fine!

EDIT2: Using Ensembl gtf increased percentage counts to 65%. I guess it using proper gtf matters a lot !
This makes sense.
I used infer_experiment.py from the suggested tool, I guess my library is unstranded ( I got 0.49 0.49 for paired end). Thanks again.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

Stick with Ensembl, the UCSC annotations are a mess.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Yup! Found it hard way.

Could you also comment on using DESeq in this case ? I have only one sample for normal and 4 for tumor. I tried both DESeq as well cuffdiff.. Both produces different results. Here it's mentioned cuffdiff seems to be a better option if one has no replicates.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

Depends a bit on the exact comparison you want to do, though there's generally no good way to handle unreplicated data.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k