Question

Rnaseq: Can I Parallelize The Tophat Step ?

5

Entering edit mode

11.7 years ago

Pierre Lindenbaum 166k

My Sample has been sequenced in several lanes and casava has split the fastqs.

./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_001.fastq.gz

I'm learning RNASeq: for an exome sequencing I would align and sort each pair (R1/R2) of fastq on our cluster and merge the BAMs later with picard.

In RNASeq , for the tophat2 step, can I parallelize the same way (and how should I merge the result) ? or should I run tophat with all the fastqs, like this ? (I'm not even sure how to write the Fastq R1 and R2...)

tophat2  -p $(NUM_THREADS) -o tophat_out \
 ref \
 BruceBanner_AGTCAA_L005_R1_001.fastq.gz,BruceBanner_AGTCAA_L008_R1_001.fastq.gz,(...) \ 
 BruceBanner_AGTCAA_L005_R2_001.fastq.gz,BruceBanner_AGTCAA_L008_R2_001.fastq.gz,(...)

Note: @BioMickWatson said on twitter ( https://twitter.com/BioMickWatson/status/400385456207974401 )

"depends if you are using coverage in the calc of splice junctions"

do you have any pointers to this ?

rnaseq tophat • 4.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 11.7 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2013-11-12

2

Entering edit mode

11.7 years ago

Devon Ryan 105k

As @BioMickWatson suggested, this sort of depends on what you want to do with the data. If you're trying to find new splice junctions or anything else novel in terms of transcripts, then this splitting this up may not be ideal. If you're just interested in differential expression, are using mouse/human/or another more mature genome, and have a good annotation file (I would suggest using an Ensembl annotation), then splitting things on a cluster would be OK. If you're interested in finding novel splice forms or transcripts then you'll be shooting yourself in the foot by running things separately.

Having said that, give STAR a try if your cluster nodes have enough RAM. It provides enough of a speed boost that this question is moot.

BTW, yes, the command you wrote would be the correct specification. You could put your mind at ease by just concatenating things (remember that gzipped files needn't be decompressed to be concatenated, so this is pretty quick).

ADD COMMENT • link 11.7 years ago by Devon Ryan 105k

0

Entering edit mode

thank you. As a beginner I will stick to tophat for now :-)

In consequence, I shouldn't use tophat options like --rg-platform-unit (lane) isn't it ? Won't it break downstream analysis (e.g: picard or gatk ) ?

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Right, I would just AddOrReplaceReadGroups after you merge the results. Are you trying to call SNPs from RNAseq data? Otherwise, there's no reason to use gatk or even add the read groups (unless you need them to keep track of samples).

ADD REPLY • link 11.7 years ago by Devon Ryan 105k

0

Entering edit mode

no I'm just interested in differential expression for now. Thanks.

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Apologies for reviving this old thread, but I am currently in the same boat as the OP and would like to probe your answer a bit further. From what I understand, evidence for novel splice junctions comes from two sources: "split reads" (two segments from the same read mapped apart nearby on the same chromosome) and "coverage islands" (distinct regions of piled up reads in the alignment) which typically indicate exons. When reads are >75bp, tophat by default disables "coverage-search" (usage of the latter source of evidence), because of the additional time and memory requirements, and also because the reads are long enough that split reads becomes a sufficient source of evidence on its own. This has been discussed here and here with the consensus that coverage-based search is generally unnecessary.

With this said, would you still argue it is better to concatenate the fastq's prior to alignment? Wouldn't it be better to process separately to perform lane-specific QC on the alignments as well to check the results of downstream analysis?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by ethan.kaufman ▴ 380

1

Entering edit mode

There's rarely any lane-specific QC needed in RNAseq.

This is a matter of how much sensitivity is needed and how much sequencing is available. Using a coverage-based search will maximize sensitivity, which may or may not be that useful. While the documentation states that this isn't needed with longer reads, that's really an over simplification. This is needed whenever you have low coverage of an isoform and, thereby, may have poor spanning of a junction. With longer reads and higher coverage this becomes less of an issue. However, for the most lowly expressed novel isoforms this may still be needed.

ADD REPLY • link 10.3 years ago by Devon Ryan 105k

0

Entering edit mode

Appreciate you taking the time to respond. Cheers.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by ethan.kaufman ▴ 380