Question: Using HTseq-count for Pseudomonas Aeruginosa data
0
gravatar for jspainhour3
15 months ago by
jspainhour310
jspainhour310 wrote:

I am trying to use HTseq to generate count data from a Pseudomonas Aeruginosa rna-seq sample. I end up with all of my output going to no feature.

__no_feature    6171314
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  478089

When I look at my gtf file:

##gff-version 2
##source-version rtracklayer 1.38.3
##date 2018-08-06
NC_002516   PseudoCAP   region  1   6264404 .   .   .   ID "NC_002516"; Name "Pseudomonas aeruginosa PAO1 NC_002516,complete genome."; Dbxref "refseq:NC_002516";
NC_002516   PseudoCAP   gene    483 2027    .   +   0   ID "gene134012"; Dbxref "GeneID:878417"; Alias "PA0001"; name "dnaA";
NC_002516   PseudoCAP   CDS 483 2027    .   +   0   ID "CDS134013"; name "chromosomal replication initiator protein DnaA"; Parent "gene134012"; locus "PA0001"
NC_002516   PseudoCAP   CDS 2056    3159    .   +   0   ID "CDS134019"; name "DNA polymerase III,beta chain"; Parent "gene134018"; locus "PA0002"
NC_002516   PseudoCAP   gene    2056    3159    .   +   0   ID "gene134018"; Dbxref "GeneID:879244"; Alias "PA0002"; name "dnaN";
NC_002516   PseudoCAP   CDS 3169    4278    .   +   0   ID "CDS134021"; name "RecF protein"; Parent "gene134020"; locus "PA0003"
NC_002516   PseudoCAP   gene    3169    4278    .   +   0   ID "gene134020"; Dbxref "GeneID:879229"; Alias "PA0003"; name "recF";

This gtf file was generated using rtracklayer from the gff file

##gff-version 3
##sequence-region chromosome 1 6264404
chromosome  PseudoCAP   region  1   6264404 .   .   .   ID=chromosome;Name=Pseudomonas aeruginosa PAO1 chromosome, complete genome.;Dbxref=refseq:NC_002516
chromosome  PseudoCAP   gene    483 2027    .   +   0   ID=gene134012;Alias=PA0001;name=dnaA;Dbxref=GeneID:878417
chromosome  PseudoCAP   CDS 483 2027    .   +   0   ID=CDS134013;Parent=gene134012;locus=PA0001;name=chromosomal replication initiator protein DnaA;
chromosome  PseudoCAP   CDS 2056    3159    .   +   0   ID=CDS134019;Parent=gene134018;locus=PA0002;name=DNA polymerase III, beta chain;
chromosome  PseudoCAP   gene    2056    3159    .   +   0   ID=gene134018;Alias=PA0002;name=dnaN;Dbxref=GeneID:879244
chromosome  PseudoCAP   CDS 3169    4278    .   +   0   ID=CDS134021;Parent=gene134020;locus=PA0003;name=RecF protein;
chromosome  PseudoCAP   gene    3169    4278    .   +   0   ID=gene134020;Alias=PA0003;name=recF;Dbxref=GeneID:879229

The GFF file and accompanying fasta file (Pseudomonas_aeruginosa_PAO1_107, both from ncbi) were used with Star 2.5.3 to generate the bam file. I changed the chromosome name from chromosome to NC_002516 so it would match the fasta file for star indexing. I used picard to deduplicate the bam file.

The GFF/GTF lists all entries as gene CDS or rna.

Does anyone have any suggestions on what I might be doing wrong or if there is some error in my formatting of the gtf file that may be causing this problem.

Thanks in advance for any advice.

rna-seq htseq-count • 583 views
ADD COMMENTlink modified 15 months ago • written 15 months ago by jspainhour310

Still working on the solution, thank you both very much.

ADD REPLYlink modified 15 months ago • written 15 months ago by jspainhour310

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 15 months ago by genomax74k
1
gravatar for Ian
15 months ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

When you run htseq-count check that the -i flag is set to ID (-i ID) and not 'gene_id', which is the GTF default.

ADD COMMENTlink written 15 months ago by Ian5.5k
1

And if you have only CDS features, you want to set -t CDS. Otherwise htseq-count is only looking for exon features.

ADD REPLYlink written 15 months ago by michael.ante3.5k

The output is the same except it now includes

CDS102784   0
CDS102786   0
CDS102788   0
CDS102792   0
CDS102794   0
CDS102796   0
CDS102798   0

But with the same final no_feature and alignment_not_unique outputs

My commands used are

htseq-count  -f bam -r pos -s no -i ID -t CDS my.bam my.gtf

I have also tried to use

--mode=intersection-nonempty

But it has had no effect on output. Am I misusing the comands? I have also looked at the bam files using IVG and there are counts found in the regions I list in the GTF file.

ADD REPLYlink modified 15 months ago • written 15 months ago by jspainhour310

What does the header for your BAM file look like? samtools view -H your.bam.

ADD REPLYlink written 15 months ago by genomax74k

The header looks like:

@HD     VN:1.4  SO:coordinate
@SQ     SN:NC_002516.2  LN:6264404
@PG     ID:STAR PN:STAR VN:STAR_2.5.3a  CL:STAR   --runThreadN 16   --genomeDir genome   --readFilesIn index13_AGTCAA_L002_R1_001.fastq   index13_AGTCAA_L002_R2_001.fastq      --outFileNamePrefix /home/d13/outputs/   --outSAMtype BAM   SortedByCoordinate      --outSAMstrandField intronMotif   --outSAMattributes NH   HI   AS   NM   MD      --outSAMunmapped Within
@PG     ID:MarkDuplicates       VN:1.131(cd60f90fdca902499c70a4472b6162ef37f919ce_1431022382)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/home/d13/in/sorted_bam/index13_AGTCAA_L002_R.genome.bam] OUTPUT=./index13_AGTCAA_L002_R.genome.deduplicated.bam METRICS_FILE=./index13_AGTCAA_L002_R.genome.duplication_metrics REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json  PN:MarkDuplicates
@CO     user command line: STAR --genomeDir genome --outSAMstrandField intronMotif --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outFileNamePrefix /home/d13/outputs/ --outSAMtype BAM SortedByCoordinate --runThreadN 16 --readFilesIn index13_AGTCAA_L002_R1_001.fastq index13_AGTCAA_L002_R2_001.fastq
ADD REPLYlink written 15 months ago by jspainhour310
1

Looks like you will have to add .2 to NC chromosome name in your annotation file. The name needs to exactly match between the files.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax74k

Thank you for your help and sharp eyes.

ADD REPLYlink written 15 months ago by jspainhour310

Hello,

May I ask how you changed the name from "chromosome" to NC_#" in your gtf file? I am getting 0 counts for my downstream htseq, and I realize I need to do the same thing to match the bam files.

Thank you!

ADD REPLYlink written 15 months ago by alexadeanfitz10

̶T̶h̶e̶ ̶c̶h̶r̶o̶m̶o̶s̶o̶m̶e̶ ̶n̶a̶m̶e̶ ̶i̶n̶ ̶t̶h̶e̶ ̶b̶a̶m̶ ̶f̶i̶l̶e̶ ̶i̶s̶ ̶b̶a̶s̶e̶d̶ ̶o̶n̶ ̶t̶h̶e̶ ̶f̶a̶s̶t̶a̶ ̶h̶e̶a̶d̶e̶r̶ ̶u̶s̶e̶d̶ ̶i̶n̶ ̶i̶n̶d̶e̶x̶ ̶g̶e̶n̶e̶r̶a̶t̶i̶o̶n̶.̶ ̶Y̶o̶u̶r̶ ̶G̶T̶F̶ ̶a̶n̶d̶ ̶y̶o̶u̶r̶ ̶F̶a̶s̶t̶a̶ ̶s̶h̶o̶u̶l̶d̶ ̶b̶e̶ ̶f̶r̶o̶m̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶s̶o̶u̶r̶c̶e̶.̶ ̶

It seems it was done by Rtracklayer - an R-tool for annotation manipulation see here

[EDIT] Mixing up things

ADD REPLYlink modified 15 months ago • written 15 months ago by michael.ante3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour