Question: Rnaseq: Can I Parallelize The Tophat Step ?
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

My Sample has been sequenced in several lanes and casava has split the fastqs.


I'm learning RNASeq: for an exome sequencing I would align and sort each pair (R1/R2) of fastq on our cluster and merge the BAMs later with picard.

In RNASeq , for the tophat2 step, can I parallelize the same way (and how should I merge the result) ? or should I run tophat with all the fastqs, like this ? (I'm not even sure how to write the Fastq R1 and R2...)

tophat2  -p $(NUM_THREADS) -o tophat_out \
 ref \
 BruceBanner_AGTCAA_L005_R1_001.fastq.gz,BruceBanner_AGTCAA_L008_R1_001.fastq.gz,(...) \ 

Note: @BioMickWatson said on twitter ( )

"depends if you are using coverage in the calc of splice junctions"

do you have any pointers to this ?

rnaseq tophat • 2.9k views
ADD COMMENTlink modified 5.0 years ago by Biostar ♦♦ 20 • written 6.3 years ago by Pierre Lindenbaum126k
gravatar for Devon Ryan
6.3 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

As @BioMickWatson suggested, this sort of depends on what you want to do with the data. If you're trying to find new splice junctions or anything else novel in terms of transcripts, then this splitting this up may not be ideal. If you're just interested in differential expression, are using mouse/human/or another more mature genome, and have a good annotation file (I would suggest using an Ensembl annotation), then splitting things on a cluster would be OK. If you're interested in finding novel splice forms or transcripts then you'll be shooting yourself in the foot by running things separately.

Having said that, give STAR a try if your cluster nodes have enough RAM. It provides enough of a speed boost that this question is moot.

BTW, yes, the command you wrote would be the correct specification. You could put your mind at ease by just concatenating things (remember that gzipped files needn't be decompressed to be concatenated, so this is pretty quick).

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Devon Ryan94k

thank you. As a beginner I will stick to tophat for now :-)

In consequence, I shouldn't use tophat options like --rg-platform-unit (lane) isn't it ? Won't it break downstream analysis (e.g: picard or gatk ) ?

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum126k

Right, I would just AddOrReplaceReadGroups after you merge the results. Are you trying to call SNPs from RNAseq data? Otherwise, there's no reason to use gatk or even add the read groups (unless you need them to keep track of samples).

ADD REPLYlink written 6.3 years ago by Devon Ryan94k

no I'm just interested in differential expression for now. Thanks.

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum126k

Apologies for reviving this old thread, but I am currently in the same boat as the OP and would like to probe your answer a bit further.  From what I understand, evidence for novel splice junctions comes from two sources: "split reads" (two segments from the same read mapped apart nearby on the same chromosome) and "coverage islands" (distinct regions of piled up reads in the alignment) which typically indicate exons.  When reads are >75bp, tophat by default disables "coverage-search" (usage of the latter source of evidence), because of the additional time and memory requirements, and also because the reads are long enough that split reads becomes a sufficient source of evidence on its own.  This has been discussed here and Coverage-Based Search In Tophat with the consensus that coverage-based search is generally unnecessary.

With this said, would you still argue it is better to concatenate the fastq's prior to alignment?  Wouldn't it be better to process separately to perform lane-specific QC on the alignments as well to check the results of downstream analysis?

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by ethan.kaufman360

There's rarely any lane-specific QC needed in RNAseq.

This is a matter of how much sensitivity is needed and how much sequencing is available. Using a coverage-based search will maximize sensitivity, which may or may not be that useful. While the documentation states that this isn't needed with longer reads, that's really an over simplification. This is needed whenever you have low coverage of an isoform and, thereby, may have poor spanning of a junction. With longer reads and higher coverage this becomes less of an issue. However, for the most lowly expressed novel isoforms this may still be needed.

ADD REPLYlink written 4.9 years ago by Devon Ryan94k

Appreciate you taking the time to respond.  Cheers.

ADD REPLYlink written 4.9 years ago by ethan.kaufman360
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1415 users visited in the last hour