Question: running Cufflinks with input from HIsat2
0
gravatar for dovah
2.7 years ago by
dovah30
dovah30 wrote:

Hi all,

I read on Cufflinks man page that input sam file, coming from others mappers than TopHat, must be sorted this way:

sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted

However, it is taking ages, and eventually causes the server on which we are doing calculations to crash. Is there any possibility to achieve the same sorting as Cufflinks wants, overcoming this sorting step? I tried samtools sort, but apparently it is not what Cufflinks needs.

Just in case, I also have alignments with STAR and MapSplice, both of them are also apparently "too big" to be handled by sort as cufflinks wants it. If you are wondering why I am not using TopHat for alignment, well you probably don't even imagine how it is slow for alignment. :P

If you have a valid alternative to Cufflinks, I am also open to new software.

Thanks in advance!

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by dovah30

And how many reads did you sequence? :-D

You ask for an alternative, what is it that you're aiming to achieve?

ADD REPLYlink written 2.7 years ago by WouterDeCoster38k

Hi! I have 251929648 reads (x2 because of paired-end). I am using Cufflinks for reference-based transcriptome assembly starting from RNA-seq reads, which are aligned back to reference genome, and currently in sam/bam format. The genome has been indexed using reference gtf file. In parallel I am also running de-novo transcriptome assembly, using Trinity. Final aim is to quantify the isoforms in this dataset.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dovah30

Well that's fairly deep sequenced, might cause you some memory issues indeed. Since Hisat is from the same group as tophat, my guess is that it might also be applicable for Cufflinks. You do not explicitly state that, but I assume you tested without sorting? And you have enough memory and disk space on your machine? Sorting creates temporary files...

ADD REPLYlink written 2.7 years ago by WouterDeCoster38k

I tried with and without sorting (if by that you mean, samtools sort). My command was: cufflinks --GTF-guide Drosophila_melanogaster.BDGP6.84.gtf --library-type fr-firststrand hisat2_alignment.sam

If I do not sort the input file, then I got the error:

You are using Cufflinks v2.2.1, which is the most recent release.
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File hisat2_alignment.sam doesn't appear to be a valid BAM file, trying SAM...
[14:06:41] Loading reference annotation.
[14:06:49] Inspecting reads and determining fragment length distribution.
> Processing Locus 211000022279132:0-1005      [                         ]   2%
Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at 3R_dna:chromosome_chromosome:BDGP6:3R:1:32079331:1:16053980, last one was at 3R_dna:chromosome_chromosome:BDGP6:3R:1:32079331:1:16054001
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names 
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.

I have 3TB free disk space on the cluster, no RAM issues as well (~250 GB available).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dovah30

I actually meant just as produced by hisat. With that amount of RAM and disk space you wouldn't expect a server to go down quickly I guess, have you monitored why it crashed?

ADD REPLYlink written 2.7 years ago by WouterDeCoster38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1068 users visited in the last hour