I'm having some trouble understanding what the collapse option does with rna-seq data and if I should be using it.
I'm using the GSEA 4.0.3 to analyze rna-seq data I was able follow the instructions here with the hallmarks collection and Mouse_ENSEMBL_Gene_ID_MSigDB.vX.chip and it ran and I thought everything had worked perfectly.
But then I also tried to run GSEA preranked with the -log10 of pvalues from sleuth and got error 1020 multiple rows mapped to RHD16(just an example), which I thought was impossible because I had dropped all duplicates using pandas when I created the rank file. So it was here that I realized that when I initially ran GSEA that I had the default option to collapse my dataset, but when I ran GSEA preranked the default is remap only. So I investigated what the collapse dataset was doing to my data and found the following:
When I read the documentation the collapse tool seems like it was meant for collapsing multiple probes to one gene, not collapsing multiple genes to one gene. But then I looked at the genes present in the gene set Hallmark collection and only RDH16 is present(for example, Rdh1, Rdh6 Rdh16f2 are not present), so it seems like the chip file is meant to match the genes present in the gene sets? Does anyone have any advice for what the best practice is for using or not using the collapse tool for rna-seq data is and what the best practice is for getting GSEA preranked to run when I have multiple rows mapped to the same gene (ie RDH16)?