Help related to HTSEq usage specific for downstream DEXSeq analysis
0
0
Entering edit mode
9 months ago
achanda • 0

I am trying to generate count files from STAR aligned sequencing data from an RNA seq analysis. I am using HTSeq to do that with the ai of performing differential exon usage analysis using DEXSeq in R. I generated flattened GTF file using:

python ~/dexseq_prepare_annotation.py gencode.v44.annotation.gtf dexseq_prepared_annotation.gtf

Then I am running a slurm job:

#!/bin/bash

#SBATCH --job-name=htseq_count
#SBATCH --output=htseq_count_%j.out
#SBATCH --error=htseq_count_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128gb
#SBATCH --time=24:00:00

# Activate conda environment 
source activate htseq_env37

# Navigate to the directory containing your data
cd ~/STAR_output

# Create a directory for count files
mkdir -p $HOME/countfiles

# Run dexseq_count.py for each sample
for sampleDir in */; do
    sampleName=$(basename "$sampleDir")
    bamFile="${sampleDir}/${sampleName}_Aligned.sortedByCoord.out.bam"
    outputPath="$HOME/Arthur_countfiles/${sampleName}_counts.txt"
    htseq-count -f bam -r pos -s no -a 10 -t exonic_part -i gene_id "$bamFile" $
done

However, I am now realizing the -i gene_id part is wrong as it needs to be exon_id or similar. Any idea on what would be the correct attribute? Also, is there other edits required? For reference, the flattened gtf file is:

chr1    dexseq_prepare_annotation.py    aggregate_gene  11869   14409   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"
chr1    dexseq_prepare_annotation.py    exonic_part 11869   12009   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "001"
chr1    dexseq_prepare_annotation.py    exonic_part 12010   12057   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "002"
chr1    dexseq_prepare_annotation.py    exonic_part 12058   12178   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "003"
chr1    dexseq_prepare_annotation.py    exonic_part 12179   12227   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "004"
chr1    dexseq_prepare_annotation.py    exonic_part 12613   12697   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "005"
Splicing DEXSeq HTSeq • 446 views
ADD COMMENT
0
Entering edit mode

Your flattened GTF file looks fine (isn't it a .gff file??).

As an alternative to htseq-count, you could use the python scripts packaged with the DEXseq R package, which you already have if you used dexseq_prepare_annotation.py. See this basic workflow: https://github.com/BarryDigby/GSE37001/tree/main?tab=readme-ov-file#differentially-expressed-exons

ADD REPLY

Login before adding your answer.

Traffic: 1683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6