I am trying my hand at RNA-Seq Analysis by going TopHat-> HTSeq-> edgeR
After TopHat, conversion of bam file to sam is recommended then to sort is recommended.
This is how I converted:
#convert bam file to sam file samtools view -h -o out.sam in.bam
and this is how I sorted:
sort -s -k my_file.sam > my_file_sorted.sam
Although, my sorted.sam file kept on giving me errors while running through HTSeq such as error when reading sam/bam file raised in count.py:84 for another file it would say 'seq'and 'qualstr' do not have the same length.
My HTSeq versions are uptodate. Also, I have PySam installed.
But, when I ran unsorted bam files or sam files through HTSeq then they were getting processed.
I want to know the significance of sorting a file and also why is HTSeq processing unsorted files?
my HTSeq command is as follows:
nohup time -p python -m HTSeq.scripts.count -f bam -s yes --idattr=ID hits.bam anno.gff3 &>mylog &
The only time I got an error for unsorted file was when they had separate pair end and single end reads in one file which separated by using
samtools view -bf 1 foo.bam > pair.bam
samtools view -bF 1 foo.bam > single.bam