How to prepare .gct file for GSEA software?
2
0
Entering edit mode
7.2 years ago
zekunmu • 0

Hello~

I am working on my NGS data for gene set enrichment analysis now. I wonder how can I convert my .fastq CleanData received from sequencing company into .gct file for GSEA software?

I know some software like GenePattern can convert some file format such as .arff and .pcl into .gct file, but it seems no such tool is available for .fastq file!

I'd appreciate it a lot if some could help!!!

next-gen sequencing GSEA • 7.0k views
ADD COMMENT
0
Entering edit mode

As WouterDeCoster said, you are really skipping too many steps.

If all you care about is GSEA, you can rank your genes using some criteria (such as from DEG analysis) and prepare a rank file. You can then feed the ranked gene list to GSEA.

ADD REPLY
4
Entering edit mode
4.0 years ago
Barry Digby ★ 1.3k

For future users who end up here, run this script:

bash format_GCT.sh norm_counts.txt output_prefix

Script assumed you have saved normalised counts from DESeq2 in R:

norm_counts <- counts(dds, normalized=TRUE)
write.table(norm_counts, file="norm_counts.txt", append = TRUE, sep="\t", row.names = TRUE, col.names = TRUE, quote = FALSE)

Here is the format_GCT.sh bash script:

#!/usr/bin/env bash

## Script to convert DESeq2 normalized counts to GCT format

if [ "$1" == "-h" ]; then
  echo "Usage: bash `basename $0` [FILE] [OUT_PREFIX]"
  echo
  echo "Example: bash `basename $0` norm_counts.txt GSE118959"
  exit 0
fi

if [ "$1" != "" ]; then
    echo "Counts File: $1"
else
    echo "Counts File is missing"
fi

if [ "$2" != "" ]; then
    echo "Output Prefix: $2"
else
    echo "Output Prefix is missing"
fi

FILE=$1
OUT=$2

TMP=`wc -l $FILE | awk '{print $1}'`
LEN=`expr $TMP - 1`
TMP=`awk '{print NF}' $FILE | sort -nu | tail -n 1`
SAMPLES=`expr $TMP - 1`


cat $FILE \
| awk -F"\t" '{if(NR==1) $1="NAME"FS$1}1' OFS="\t" \
| awk '{$1 = $1 OFS (NR==1?"Description":"na")}1' \
| sed 's/ /\t/g' \
| awk -v LEN=$LEN -v SAMPLES=$SAMPLES \
 'BEGIN{print "#1.2""\n"LEN"\t"SAMPLES}1' > ${OUT}.gct
ADD COMMENT
0
Entering edit mode

Thanks! It's really helpful. I tried this script and it works but the row names and column names in the output file contains quote for each gene/cell. How to remove the quotes as GSEA gene set doesn't use quote.

ADD REPLY
1
Entering edit mode

When saving the normalized counts in R set quote = FALSE .

Or just remove them using sed sed 's/"//g' your_file.gct > no_quotes.gct

ADD REPLY
1
Entering edit mode
7.2 years ago

You are skipping an awful lot of steps of your analysis. Your fastq data first needs to be aligned to the genome, reads need to be counted and you need to perform differential expression analysis. The results of the last step can be used in GSEA.

ADD COMMENT
0
Entering edit mode

Thank you so much! I have worked it out, what you said is really true. It was my first time working on NGS data. I finally found out the sequencing company actually processed the data somewhat, such as alignment and reads counting.

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6