Question

STAR Aligner Suddently changed output file???

1

Entering edit mode

6.8 years ago

jack.baines2 ▴ 10

Hi, I have been using STAR v2.5.2b to align RNA-Seq fastq.gz files and output as SAM. for some reason, without changing the .sh file, instead of outputting the SAM file into the specified directory, it places it into the designated directory/2pass/

All the appropriate files appear to be there, but I don't understand why this change has occurred and wonder whether the alignment has not successfully completed (although the .out file logs the job as successfully complete). More confusing is that this change appeared to happen mid array job with one of the samples outputting this way and then all subsequent files doing so.

I've compared the .sh script to earlier versions meticulously and can't see any reason for the difference.

Any idea why this has happened?

Script below:

#!/bin/bash
#$ -cwd -V
#$ -pe smp 20
#$ -l h_vmem=40G
#$ -N project1
#$ -hold_jid project_prep
#$ -o /logs/star.$JOB_ID.$TASK_ID.out 
#$ -e /logs/star.$JOB_ID.$TASK_ID.err 

module add apps/STAR/2.5.2b

source gatk_variables.sh

FASTQ1=$(ls ${WORK}/fastq/Sample_${SGE_TASK_ID}/*R1*)
FASTQ2=$(ls ${WORK}/fastq/Sample_${SGE_TASK_ID}/*R2*)

# 1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:
# DO ONCE FOR ALL RUNS

GENOME=${WORK}/hg19

#mkdir ${GENOME}
#STAR --runMode genomeGenerate --genomeDir ${GENOME} \
#    --genomeFastaFiles ${HG19} \
#    --runThreadN ${NSLOTS}

# 2) Alignment jobs were executed as follows:

runDir=${TMPDIR}/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir ${GENOME} --readFilesIn ${FASTQ1} ${FASTQ2} \
    --readFilesCommand zcat --runThreadN ${NSLOTS}

# 3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=${TMPDIR}/hg19_2pass
mkdir ${genomeDir}
STAR --runMode genomeGenerate --genomeDir ${genomeDir} \
    --genomeFastaFiles ${HG19} --runThreadN ${NSLOTS} \
    --sjdbFileChrStartEnd ${TMPDIR}/1pass/SJ.out.tab --sjdbOverhang 75

# 4) The resulting index is then used to produce the final alignments as follows:

runDir=${TMPDIR}/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir ${genomeDir} --readFilesIn ${FASTQ1} ${FASTQ2} \
    --readFilesCommand zcat --runThreadN ${NSLOTS}

mv ${runDir} ${WORK}/bam/Sample_${SGE_TASK_ID}

STAR RNA UNIX ALIGNMENT 2PASS • 3.5k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.8 years ago by jack.baines2 ▴ 10

1

Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

It is really not clear what the problem is, you have there

runDir=${TMPDIR}/2pass
mkdir $runDir
cd $runDir

what is the problem really?

ADD REPLY • link 6.8 years ago by Istvan Albert 100k

0

Entering edit mode

Hi Istvan, Thank you for the reply.

The issue is that there has been a change in the behavior of the module which I can't account for. On one level this is simply frustrating as my workflow is disrupted and I have to edit my other scripts to account for this change of directory path.

However more of a concern, which I didn't previously allude to is that I am having trouble running data through my workflow more generally which up until recently I had no trouble with. It may or may not be connected to this, but if I can find out why this has happened it may help with my greater issues.

Best wishes

ADD REPLY • link 6.8 years ago by jack.baines2 ▴ 10

1

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

It is not clear if you have multiple samples or not, but for STAR 2.5.2b, the recommended method for 2pass mapping is described on sections 8.1 (multi-sample) and 8.2 (per sample). You are using an old method, described on section 8.3:

Since 2.4.1a, it is recommended to use the on the fly 2-pass options as described above.

You could avoid the perceived discrepancy using --outFileNamePrefix /path/to/${SGE_TASK_ID}, or something like it.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, thank you for replying.

As I said I am running multiple samples in an array and the change started by affecting one of the array jobs, then continuing to affect all subsequent runs. As the script says this is run once for all runs. Do you mean something else by multiple samples? Thank you for the advice though, I will read those sections.

I could definitely change the script to achieve different results, but my query is why has the result changed without changing the input.

Best Wishes

ADD REPLY • link 6.8 years ago by jack.baines2 ▴ 10