Question

HTSeq error: local variable 'NH' referenced before assignment

0

Entering edit mode

6.4 years ago

senowinski ▴ 30

I have some BAM files, which were trimmed using BBDUK:

java -ea -Xmx7174m -Xms7174m -cp jgi.BBDuk in=test.fastq.gz out=test.fastq.gz.trimmed_clean ref=polyA.fa.gz,truseq_rna.fa.gz k=13 ktrim=r useshortkmers=t mink=5 qtrim=r trimq=10 minlength=20

And aligned with STAR using the following example command:

STAR --genomeDir refgenome --readFilesIn test.fastq.gz.trimmed_clean --outFileNamePrefix test.fastq.gz.trimmed_clean_align --sjdbOverhang 73 --runThreadN 11 --outSAMtype BAM SortedByCoordinate  --twopassMode Basic --twopass1readsN -1 --outFilterScoreMinOverLread 0.5 --outFilterMatchNminOverLread 0.5

I sort and add on read groups (WTCHG_524616_7039_1) and after running ValidateSamFile I got a warning

java -jar /data/BCI-EvoCa2/jacob/Central/software/picard.jar ValidateSamFile I=output_test.sorted.bam MODE=SUMMARY

warning:

WARNING:MISSING_TAG_NM  7774993

which I fixed with:

samtools calmd -bAr test.sorted.bam ucsc.hg19.fasta > output_test.sorted.bam

However after when I run HTSeq to call counts using the following command:

python /data/BCI-EvoCa2/salpie/HTSeq-count-master/HTSeq/scripts/count.py -s no --format=sam output_test.sorted.bam /data/BCI-IBD/analysis/RNA/gencode.v19.annotation.gtf > test.sorted.sam.output

After the GTF file is read in I get the following error:

Error occured when processing SAM input (line 7461 of file output_test.sorted.sam):
  local variable 'NH' referenced before assignment
  [Exception type: UnboundLocalError, raised in count.py:201]

When I look convert the file to a sam and then rerun to find the line that's wrong in the sam file, it looks like this. Can anyone tell me what's wrong with it?

K00150:332:HVM5TBBXX:7:2124:9303:4532   16  chrM    221 255 43M10439N16M    *   0   0   GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCCTAGTCTCAATCTCCAA FJJJFA7JJAJJJFFJJJJFJFFFJJFFFJJJFFJJJFFFFJJJJJJFF<-AJFF7<<A RG:Z:WTCHG_524616_7039_1    NH:i:1  HI:i:1  nM:i:1  AS:i:56 NM:i:1  MD:Z:42G16

Other lines look like this (before and after)

Line above

K00150:332:HVM5TBBXX:7:2124:19573:1930  16  chrM    221 255 52M *   0   0   GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACA    EJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFFFAA    RG:Z:WTCHG_524616_7039_1    NH:i:1  HI:i:1  nM:i:0  AS:i:51 ZQ:Z:E@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   NM:i:0  MD:Z:52

Line below

K00150:332:HVM5TBBXX:7:2124:11353:4532  16  chrM    221 255 56M *   0   0   GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAACCGCTTCCCACACAGA    EJJJJJFFJJJJJJJJJJJJJFJJJJJJFJJJJJJFFJ::JJJJ2/-/2FFAF<AA    RG:Z:WTCHG_524616_7039_1    NH:i:1  HI:i:1  nM:i:2  AS:i:51 ZQ:Z:E@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@PL@@@@X[][X@@@@@@@   NM:i:2  MD:Z:39G6T9

The 4th column in the line at fault reads "43M10439N16M" while the other okay lines read "52M " and "56M " is this causing the error and does anyone know how to fix this?

Thank you!

HTSeq RNA-Seq • 1.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 6.4 years ago by senowinski ▴ 30

score 1 · Answer 1 · 2019-02-12

This appears to be a bug in the version of HTSeq you're using. Your options are one of the following:

See if there's a newer version and upgrade (if not, please report this bug to the author).
Use a different tool, such as featureCounts from the subread package (this is much faster than htseq-count).
Rerun STAR using quant mode, which can produce BAM files and counts at the same time.

I personally prefer option (2), but to each their own.