Question

RNA-Seq Analysis: To sort or not to Sort?

0

Entering edit mode

8.4 years ago

kanika.151 ▴ 130

Hello,

I am trying my hand at RNA-Seq Analysis by going TopHat-> HTSeq-> edgeR

After TopHat, conversion of bam file to sam is recommended then to sort is recommended.

This is how I converted:

#convert bam file to sam file
samtools view -h -o out.sam in.bam

and this is how I sorted:

sort -s -k my_file.sam > my_file_sorted.sam

Although, my sorted.sam file kept on giving me errors while running through HTSeq such as error when reading sam/bam file raised in count.py:84 for another file it would say 'seq' and 'qualstr' do not have the same length.

My HTSeq versions are uptodate. Also, I have PySam installed.

But, when I ran unsorted bam files or sam files through HTSeq then they were getting processed.

I want to know the significance of sorting a file and also why is HTSeq processing unsorted files?

my HTSeq command is as follows:

nohup time -p python -m HTSeq.scripts.count -f bam -s yes --idattr=ID hits.bam anno.gff3 &>mylog &

The only time I got an error for unsorted file was when they had separate pair end and single end reads in one file which separated by using

samtools view -bf 1 foo.bam > pair.bam
samtools view -bF 1 foo.bam > single.bam

htseq sam rna-seq • 6.1k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.4 years ago by kanika.151 ▴ 130

Ram · Answer 1 · 2015-11-24

2

Entering edit mode

8.4 years ago

Sam ★ 4.7k

Why don't you use samtools sort directly?

Something like samtools sort sam > sort.sam

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.4 years ago by Sam ★ 4.7k

0

Entering edit mode

Okay. I will try that too. Can it be used to sort bam files too?

ADD REPLY • link 8.4 years ago by kanika.151 ▴ 130

0

Entering edit mode

Yes, it can. By the way, did you check if maybe your original files are already sorted? Some programmes sort files for you.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by mkulecka ▴ 360

0

Entering edit mode

All ready sorted? as in? we ran through NGS QC then through TopHat...I don't remember sorting them...

How do I check for it?

ADD REPLY • link 8.4 years ago by kanika.151 ▴ 130

3

Entering edit mode

cat your_file.sam | grep SO

If you get something like "SO:coordinate" then your files are already sorted.

With bam files:

samtools view -H your_bam_file.bam | grep SO

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by mkulecka ▴ 360

0

Entering edit mode

How does samtools work?

the command I used was:

samtools sort -n in.bam sort.sam

but it is giving me separate files in output:

sort.sam.001.bam
sort.sam.002.bam ...

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by kanika.151 ▴ 130

1

Entering edit mode

Hi !

You just need to wait. samtools sort first creates several sorted files that are merged in one final file at the end of the process. It can take some time.

ADD REPLY • link 8.4 years ago by Carlo Yague 8.7k

0

Entering edit mode

Yes it did! Thanks :)

ADD REPLY • link 8.4 years ago by kanika.151 ▴ 130

0

Entering edit mode

Hi Carlo Yague, would you know how long samtools sort usually take? I feel like it would depend on how big my bam file is (437 GB), but is >3 days normal? Should I not be worried as long as the number of temp files is increasing/ there are temp output files? Thank you!

ADD REPLY • link 3.6 years ago by GB • 0

0

Entering edit mode

Hi, to be honest, I never tried to sort a file that large, but yeah, the larger the file, the longer it takes... In such a case, it becomes especially important to optimize the -@ and -m parameters of samtools (number of threads and available memory per thread). Hope this helps. Carlo.

ADD REPLY • link 3.6 years ago by Carlo Yague 8.7k

score 1 · Answer 2 · 2015-11-24

To properly account for paired reads, htseq-count expects them to be sorted, either by name (-r name parameter, the default for htseq-count) or by position (-r pos). There is a bug with -r pos which causes htseq-count to crash, I think it is still unfixed.

I think htseq-count does not check if the reads are sorted, if you run it on an unsorted file (or sorted by position), it will not account for read pairing and you will get approximately double counts per feature - a bit less, because sometimes just one read of the pair maps. When I did this mistake, I got between 1.8x and 1.9x more counts per gene, on average.

P.S. edit: for a RNAseq differential expression analysis, you should consider Subread+featureCounts, or STAR with --quantMode GeneCounts, both replacing TopHat+HTSeq, then proceeding to edgeR as before.