Question

How to prepare .gct file for GSEA software?

0

Entering edit mode

7.2 years ago

zekunmu • 0

Hello~

I am working on my NGS data for gene set enrichment analysis now. I wonder how can I convert my .fastq CleanData received from sequencing company into .gct file for GSEA software?

I know some software like GenePattern can convert some file format such as .arff and .pcl into .gct file, but it seems no such tool is available for .fastq file!

I'd appreciate it a lot if some could help!!!

next-gen sequencing GSEA • 7.0k views

ADD COMMENT • link updated 4.0 years ago by Barry Digby ★ 1.3k • written 7.2 years ago by zekunmu • 0

0

Entering edit mode

As WouterDeCoster said, you are really skipping too many steps.

If all you care about is GSEA, you can rank your genes using some criteria (such as from DEG analysis) and prepare a rank file. You can then feed the ranked gene list to GSEA.

ADD REPLY • link 7.2 years ago by moxu ▴ 510

score 4 · Answer 1 · 2020-04-27

For future users who end up here, run this script:

bash format_GCT.sh norm_counts.txt output_prefix

Script assumed you have saved normalised counts from DESeq2 in R:

norm_counts <- counts(dds, normalized=TRUE)
write.table(norm_counts, file="norm_counts.txt", append = TRUE, sep="\t", row.names = TRUE, col.names = TRUE, quote = FALSE)

Here is the format_GCT.sh bash script:

#!/usr/bin/env bash

## Script to convert DESeq2 normalized counts to GCT format

if [ "$1" == "-h" ]; then
  echo "Usage: bash `basename $0` [FILE] [OUT_PREFIX]"
  echo
  echo "Example: bash `basename $0` norm_counts.txt GSE118959"
  exit 0
fi

if [ "$1" != "" ]; then
    echo "Counts File: $1"
else
    echo "Counts File is missing"
fi

if [ "$2" != "" ]; then
    echo "Output Prefix: $2"
else
    echo "Output Prefix is missing"
fi

FILE=$1
OUT=$2

TMP=`wc -l $FILE | awk '{print $1}'`
LEN=`expr $TMP - 1`
TMP=`awk '{print NF}' $FILE | sort -nu | tail -n 1`
SAMPLES=`expr $TMP - 1`


cat $FILE \
| awk -F"\t" '{if(NR==1) $1="NAME"FS$1}1' OFS="\t" \
| awk '{$1 = $1 OFS (NR==1?"Description":"na")}1' \
| sed 's/ /\t/g' \
| awk -v LEN=$LEN -v SAMPLES=$SAMPLES \
 'BEGIN{print "#1.2""\n"LEN"\t"SAMPLES}1' > ${OUT}.gct

score 1 · Answer 2 · 2017-02-23

1

Entering edit mode

7.2 years ago

WouterDeCoster 47k

You are skipping an awful lot of steps of your analysis. Your fastq data first needs to be aligned to the genome, reads need to be counted and you need to perform differential expression analysis. The results of the last step can be used in GSEA.

ADD COMMENT • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you so much! I have worked it out, what you said is really true. It was my first time working on NGS data. I finally found out the sequencing company actually processed the data somewhat, such as alignment and reads counting.

ADD REPLY • link 7.1 years ago by zekunmu • 0