Entering edit mode
9 months ago
achanda
•
0
I am trying to generate count files from STAR aligned sequencing data from an RNA seq analysis. I am using HTSeq to do that with the ai of performing differential exon usage analysis using DEXSeq in R. I generated flattened GTF file using:
python ~/dexseq_prepare_annotation.py gencode.v44.annotation.gtf dexseq_prepared_annotation.gtf
Then I am running a slurm job:
#!/bin/bash
#SBATCH --job-name=htseq_count
#SBATCH --output=htseq_count_%j.out
#SBATCH --error=htseq_count_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128gb
#SBATCH --time=24:00:00
# Activate conda environment
source activate htseq_env37
# Navigate to the directory containing your data
cd ~/STAR_output
# Create a directory for count files
mkdir -p $HOME/countfiles
# Run dexseq_count.py for each sample
for sampleDir in */; do
sampleName=$(basename "$sampleDir")
bamFile="${sampleDir}/${sampleName}_Aligned.sortedByCoord.out.bam"
outputPath="$HOME/Arthur_countfiles/${sampleName}_counts.txt"
htseq-count -f bam -r pos -s no -a 10 -t exonic_part -i gene_id "$bamFile" $
done
However, I am now realizing the -i gene_id
part is wrong as it needs to be exon_id or similar. Any idea on what would be the correct attribute? Also, is there other edits required? For reference, the flattened gtf file is:
chr1 dexseq_prepare_annotation.py aggregate_gene 11869 14409 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"
chr1 dexseq_prepare_annotation.py exonic_part 11869 12009 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "001"
chr1 dexseq_prepare_annotation.py exonic_part 12010 12057 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "002"
chr1 dexseq_prepare_annotation.py exonic_part 12058 12178 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "003"
chr1 dexseq_prepare_annotation.py exonic_part 12179 12227 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "004"
chr1 dexseq_prepare_annotation.py exonic_part 12613 12697 . + . gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "005"
Your flattened GTF file looks fine (isn't it a
.gff
file??).As an alternative to
htseq-count
, you could use the python scripts packaged with theDEXseq
R package, which you already have if you useddexseq_prepare_annotation.py
. See this basic workflow: https://github.com/BarryDigby/GSE37001/tree/main?tab=readme-ov-file#differentially-expressed-exons