Question: questions on rna-seq data analysis
0
gravatar for poisonAlien
4.5 years ago by
poisonAlien2.8k
Asgard
poisonAlien2.8k wrote:

Hi all I have some doubts on rna-seq analysis. I have rna-seq data (paired end using trueseq) from patients - 4 tumor and one normal (not matched normal). I am trying to find differentially expressed genes. So far so good - aligned with tophat2, estimated expression with cufflinks, got counts with ht-seq count, did differential expression using DESeq2. But here are my concerns -

1. While counting with htseq-count I am getting very low counts assigned to features. On an average only 30% of the fragments are assigned to features (refseq gtf). Maximum of the remaining 70% fragments are ambiguous or no feature.  Is it common to find such low percentage of counts ? I read one could get around 60%. I tried with all modes in htseq (reverse union, stranded union, unstranded union and with intersection nonempty) and the best result was with unstranded union mode.

2. For some of the features ht-seq count assigned zero counts. For example between region chr20:3776385-3786768 htseq gives -

3919 SAM alignment pairs processed.
NM_001287516    0
NM_001287517    0
NM_001287518    0
NM_001287519    0
NM_001287520    0
NM_001287522    0
NM_001287524    5
NM_004358    0
NM_021872    0
NM_021873    0
__no_feature    82
__ambiguous    3683
__too_low_aQual    0
__not_aligned    0
__alignment_not_unique    149

Of ~3900 pairs 3683 are ambiguous and only 5 are assigned to one of the feature. I get this, since there are so many overlapping genes within 10kb, there is no way to tell which gene a fragment originated from, hence so many ambiguous. But when I used cufflinks for the same (some of) genes I get high FPKM values -

NM_001287516    1.69492e-71
NM_001287517    9.62116e-75
NM_001287518    1.05537e-103
NM_001287519    0.000182958
NM_001287520    31.2244
NM_001287522    6.82458
NM_001287524    1.72693
NM_004358    1.47296
NM_021872    0.017841
NM_021873    17.0096

Does this mean cufflinks does not account for ambiguous reads while calculating FPKM ?

3. Some of the threads I saw, say that for trueseq libraries we should use fr-firststrand as library type during tophat2 alignment. But here it is metioned to use fr-unstranded (I guess it should also be considered while using htseq). Which is the right way ?

4. Since I do not have replicates/samples for normal tissue, using DESeq2 is advisable or should I be better off with fold changes from FPKM?

Thanks.

ADD COMMENTlink modified 4.5 years ago by michael.ante3.3k • written 4.5 years ago by poisonAlien2.8k
2
gravatar for michael.ante
4.5 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

Hi PoisonAlien,

1) I fear you are counting reads on transcript-level, but htseq-count is rather designed for gene-counting. It counts reads falling unambiguously to defined regions (exons). Since most transcripts of a given gene share some exons, the reads of such shared exons cannot be assigned to a distinct transcript and are therefore counted as "ambiguous".

2) Cufflinks on the other hand tries to resolve the transcripts' abundances per gene with a coverage-based approach. A single read can thereby assigned partially to different transcripts.

3) Use RSeQC to get the correct library type (see A: tophat --library-type option)

4) It is always hard without replicates/samples, and I'd guess even harder with different sample sizes in your case. I would go for cuffdiff first. Maybe you find similar data (organ, sex, library prep., sequencer, etc.)  in public available databases like GEOSRA, or GXA in order to have equal sample sizes.

Cheers,

Michael

ADD COMMENTlink written 4.5 years ago by michael.ante3.3k

Hi Michael,

Thank you for the reply.

1. I am using htseq-count in default mode (gene level). I checked annotation for above transcripts (NM_001287516 to 24) they all map to gene CDC25B but they are all duplicates. ucsc refflat gtf has same gene_id (CD25B) for all above transcripts but ucsc_refseq gtf has different gene_id (as above) for all transcripts. This is sort of confusing and results different outputs. What is the best gtf to use in this case ? ensembl also has differnet tx_id and gene_id similar to refseq.

EDIT: Sorry, Ensebl gtf works fine!

EDIT2: Using Enseble gtf increased percentage counts to 65%. I guess it using proper gtf matters a lot !

2. This makes sense.

3. I used infer_experiment.py from the suggested tool, I guess my library is unstranded ( i got 0.49 0.49 for paired end). Thanks again.

 

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by poisonAlien2.8k

Stick with Ensembl, the UCSC annotations are a mess.

ADD REPLYlink written 4.5 years ago by Devon Ryan91k

Yup ! Found it hard way.

Could you also comment on using DESeq in this case ? I have only one sample for normal and 4 for tumor. I tried both DESeq as well cuffdiff.. Both produces different results. Here its mentioned cuffdiff seems to be a better option if one has no replicates.

ADD REPLYlink written 4.5 years ago by poisonAlien2.8k

Depends a bit on the exact comparison you want to do, though there's generally no good way to handle unreplicated data.

ADD REPLYlink written 4.5 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1053 users visited in the last hour