Question

DEXSeq aggregates two different genes together

0

Entering edit mode

4.0 years ago

rohitsatyam102 ▴ 850

Hi All!

I am relatively new to DEXSeq. I was trying to follow a tutorial given at Bioconductor here

chr16 dexseq_prepare_annotation.py aggregate_gene 69320140 69408571 . - . gene_id "ENSG00000259900.5+ENSG00000272617.3+ENSG00000260371.1+ENSG00000132604.11+ENSG00000258429.2+ENSG00000157315.5+ENSG00000213380.15" chr16 dexseq_prepare_annotation.py exonic_part 69320140 69321046 . - . transcripts "ENST00000564419.1"; exonic_part_number "001"; gene_id "ENSG00000259900.5+ENSG00000272617.3+ENSG00000260371.1+ENSG00000132604.11+ENSG00000258429.2+ENSG00000157315.5+ENSG00000213380.15" It's weird to see TERF2 (Chromosome 16: 69,355,567-69,408,571) aggregate with RP11-343C2.9 because of partial overlap of one exon, as shown here. Wouldn't this bias the exon usage if we are interested in studying only exon usage of one gene. How is the analysis affected if I consider overlapping/shared exon separately using "-r no" option?

Also, in DEXSeq, I am unable to perform gene subsetting using dxd = dxd[geneIDs( dxd ) %in% genesForSubset,] I am not sure how to make the "geneIDsinsubset.txt" file properly. I have a .csv file with a single gene ensemble ID: ENSG00000132604.11. However running the above command, DEXSeq throws the following error

Error in `$<-.data.frame`(`*tmp*`, "dispersion", value = NA) :  replacement has 1 row, data has 0

Please help...

dexseq RNA-Seq rna-seq • 1.9k views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe 87k • written 4.0 years ago by rohitsatyam102 ▴ 850

2

Entering edit mode

How is the analysis affected if I consider overlapping/shared exon separately using "-r no" option?

I think that you can try that and then report back here, if you have time. Thanks.

Regarding "geneIDsinsubset.txt", it should just be a single-column file of gene IDs that will be used to filter your data.

Kevin

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

score 1 · Accepted Answer · 2020-05-26

1

Entering edit mode

3.9 years ago

rohitsatyam102 ▴ 850

Hi Kevin. First of all, thanks for responding. I did the analysis and realized that the -r option is throwing 1/4th of my reads into the "empty" category which I think is not considered while performing Differential Exon Usage. I have raised a question here regarding the same. However, not using -r functions groups genes in a very undesirable way. Say Gene D and E share an exon present at 5' end of E. And A and B gene also shares an exon. If B shares an exon with C and C shares an exon with D, DEXSeq combines all 5 genes in form A+B+C+D+E, which is kinda troubling and increases the exonic bins (in my case there were 75 exonic bins, that got reduced to 35 on using the option "-r no"). Also, I realized that in such a case where the genes are grouped, simply using Ensemble Gene ID won't help. Rather, the user needs to go and check the .gff he/she created in the previous step to see, if their gene of interest is grouped or not, and if grouped they have to use the entire "A+B+C+D+E" as a single entity and mention in the geneIDsinsubset.txt.

Thus using ENSG00000259900.5+ENSG00000272617.3+ENSG00000260371.1+ENSG00000132604.11+ENSG00000258429.2+ENSG00000157315.5+ENSG00000213380.15 in geneIDsinsubset.txt helped me.

ADD COMMENT • link 3.9 years ago by rohitsatyam102 ▴ 850

1

Entering edit mode

Hey Rohit, thanks for coming back to explain - means a lot. Unfortunately, I don't have much to add due to the fact that the last time that I used DEXSeq was in 2016. I imagine that it is very difficult (for any program) to account for every possible splice-configuration, though. For what it's worth, the DEXSeq developer hangs out on the Bioconductor forum, so, you may get a better answer there: https://support.bioconductor.org/t/Latest/

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Rohit,

I'm having the same problem with Error in $<-.data.frame(*tmp*, "dispersion", value = NA) : replacement has 1 row, data has 0 being returned from dxd = dxd[geneIDs( dxd ) %in% genesForSubset,].

I have checked my gff file and it does look like the geneID is grouped with two others, so I have written the Ensembl gene IDs as A+B+C in the geneIDsinsubset.txt file.

However, when I updated the genesForSubset and tried to run the dxd = dxd[geneIDs( dxd ) %in% genesForSubset,] command, exactly the same error appeared. Was there anything else you did to get this to work that you've not stated above? Does the command need editing when IDs are grouped like this?

ADD REPLY • link 3.9 years ago by rowan-taylor1 • 0

1

Entering edit mode

Check if you by mistake copied the bin numbers present after the ":" something like this:-

ENSG00000157315.5+ENSG00000213380.15:E001

remove

:E001

or if you forgot to put like .15 present in one of my ensemble IDs. I would like to see the list of IDS you are using because small problems like this have kept me scratching my head.

ADD REPLY • link 3.9 years ago by rohitsatyam102 ▴ 850

0

Entering edit mode

@rowan-taylor1 please give a thumbs up if it solved your problem.

ADD REPLY • link 3.9 years ago by rohitsatyam102 ▴ 850

0

Entering edit mode

I sorted it out, Rohit. I moved my original answer to a comment, and moved your comment to an answer. Thanks!

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k