Question: How to prepare .gct file for GSEA software?
gravatar for zekunmu
3.5 years ago by
zekunmu0 wrote:


I am working on my NGS data for gene set enrichment analysis now. I wonder how can I convert my .fastq CleanData received from sequencing company into .gct file for GSEA software?

I know some software like GenePattern can convert some file format such as .arff and .pcl into .gct file, but it seems no such tool is available for .fastq file!

I'd appreciate it a lot if some could help!!!

sequencing gsea next-gen • 2.4k views
ADD COMMENTlink modified 3 months ago by Barry Digby420 • written 3.5 years ago by zekunmu0

As WouterDeCoster said, you are really skipping too many steps.

If all you care about is GSEA, you can rank your genes using some criteria (such as from DEG analysis) and prepare a rank file. You can then feed the ranked gene list to GSEA.

ADD REPLYlink written 3.5 years ago by moxu460
gravatar for Barry Digby
3 months ago by
Barry Digby420
National University of Ireland, Galway
Barry Digby420 wrote:

For future users who end up here, run this script:

bash norm_counts.txt output_prefix

Script assumed you have saved normalised counts from DESeq2 in R:

norm_counts <- counts(dds, normalized=TRUE)
write.table(norm_counts, file="norm_counts.txt", append = TRUE, sep="\t", row.names = TRUE, col.names = TRUE, quote = FALSE)

Here is the bash script:

#!/usr/bin/env bash

## Script to convert DESeq2 normalized counts to GCT format

if [ "$1" == "-h" ]; then
  echo "Usage: bash `basename $0` [FILE] [OUT_PREFIX]"
  echo "Example: bash `basename $0` norm_counts.txt GSE118959"
  exit 0

if [ "$1" != "" ]; then
    echo "Counts File: $1"
    echo "Counts File is missing"

if [ "$2" != "" ]; then
    echo "Output Prefix: $2"
    echo "Output Prefix is missing"


TMP=`wc -l $FILE | awk '{print $1}'`
LEN=`expr $TMP - 1`
TMP=`awk '{print NF}' $FILE | sort -nu | tail -n 1`
SAMPLES=`expr $TMP - 1`

cat $FILE \
| awk -F"\t" '{if(NR==1) $1="NAME"FS$1}1' OFS="\t" \
| awk '{$1 = $1 OFS (NR==1?"Description":"na")}1' \
| sed 's/ /\t/g' \
 'BEGIN{print "#1.2""\n"LEN"\t"SAMPLES}1' > ${OUT}.gct
ADD COMMENTlink modified 3 months ago • written 3 months ago by Barry Digby420

Thanks! It's really helpful. I tried this script and it works but the row names and column names in the output file contains quote for each gene/cell. How to remove the quotes as GSEA gene set doesn't use quote.

ADD REPLYlink written 4 weeks ago by tianqi.ma19940

When saving the normalized counts in R set quote = FALSE .

Or just remove them using sed sed 's/"//g' your_file.gct > no_quotes.gct

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Barry Digby420
gravatar for WouterDeCoster
3.5 years ago by
WouterDeCoster44k wrote:

You are skipping an awful lot of steps of your analysis. Your fastq data first needs to be aligned to the genome, reads need to be counted and you need to perform differential expression analysis. The results of the last step can be used in GSEA.

ADD COMMENTlink written 3.5 years ago by WouterDeCoster44k

Thank you so much! I have worked it out, what you said is really true. It was my first time working on NGS data. I finally found out the sequencing company actually processed the data somewhat, such as alignment and reads counting.

ADD REPLYlink written 3.4 years ago by zekunmu0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 702 users visited in the last hour