Aligning multiple fastq files with genome in one script/one line with STAR
2
0
Entering edit mode
3 days ago
j_eag • 0

Hi there! This is probably a VERY basic question but I don't have the best terminal skills so I'm struggling a little.

I want to apply what I wrote below for all my fastq scripts without doing a for loop or manually writing the code for each (ideally they all run in parallel, or at least in the same script):

STAR --runThreadN 12  --genomeDir /scratch/eg5/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eg51/trial2/trimmed_files/SRR12045658_1_trimmed.fq.gz --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate


I tried to use *.fq.gz, but that did not work and I couldn't specify the outFileNamePrefix as/ "* 

STAR --runThreadN 12  --genomeDir /scratch/eag519/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eag519/trial2/trimmed_files/*.fq.gz --sjdbGTFfile /scratch/eag519/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eag519/trial2/align/results/STARresults/* --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate


I also tried putting all the files in the same line but that resulted in the following error :

"SOLUTION:specify only one or two files in the --readFilesIn option. If file names contain spaces, use quotes: "file name" "

 STAR --runThreadN 12  --genomeDir /scratch/eag519/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eag519/trial2/trimmed_files/SRR12045658_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045661_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045664_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045667_1_trimmed.fq.gz --sjdbGTFfile /scratch/eag519/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eag519/trial2/align/alignmentresults/STARresults --limitGenomeGenerateRAM 32000000000 --outBAMsortingThreadN 4 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonicalUnannotated --outSAMattributes All --outSAMtype BAM SortedByCoordinate


Anyone have any ideas? ie how to enter all the files in , and then have them output with their own file names. Sorry I'm sure this is a very basic question .

SAM STAR bioinformatics rna-seq sequencing • 341 views
2
Entering edit mode
3 days ago

with parallel

parallel --dry-run STAR --runThreadN 12 --genomeDir /scratch/eg5/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eg51/trial2/trimmed_files/{} --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate ::: *trimmed.fq.gz  This is a dry-run and common string in files is trimmed.fq.gz. Change the common string as per your requirements. with find:  find * -type f -name "*trimmed.fq.gz" -exec echo STAR --runThreadN 12  --genomeDir /scratch/eg5/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eg51/trial2/trimmed_files/{} --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate \;


Check your output file name prefixes before you execute the parallel command and mind the threads.

0
Entering edit mode

Thanks alot! It didn't work for me for some reason but I'm sure I'll figure it out.

0
Entering edit mode

It's a dry run for both parallel and find. Remove dry-run if you are using parallel and remove echo if you are using find command. You need to post what you meant by not working for eg. error logs etc.

1
Entering edit mode
3 days ago

First and foremost simplify your command with environment variables:

export IDX=/scratch/eg5/trial2/align/sequence/STARindex
export GTF=/scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf
export INP=/scratch/eg51/trial2/trimmed_files/
export RES=/scratch/eg51/trial2/align/results/STARresults


now the command looks more manageable and reusable:

STAR --runThreadN 12  --genomeDir $IDX --readFilesCommand gunzip -c --readFilesIn$INP/SRR12045658_1_trimmed.fq.gz --sjdbGTFfile\
$GTF --outFileNamePrefix$RES --limitGenomeGenerateRAM 32000000000 \
--outSAMtype BAM SortedByCoordinate


Identify what changes in the command, suppose you want to replace $INP with SRR numbers, put those numbers into a file. Suppose the file contains: SRR12045658 SRR12045659  now write a parallel script that PRINTS to the screen  cat ids | parallel echo STAR --runThreadN 12 --genomeDir$IDX --readFilesCommand gunzip -c --readFilesIn $INP/{}_1_trimmed.fq.gz\ --sjdbGTFfile$GTF  --outFileNamePrefix $RES --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate  verify that the command it prints are valid. Now you can either remove the echo and directly run the commands, or even better, keep the echo have it write you a script:  cat ids | parallel echo STAR --runThreadN 12 --genomeDir$IDX --readFilesCommand gunzip -c --readFilesIn $INP/{}_1_trimmed.fq.gz\ --sjdbGTFfile$GTF  --outFileNamePrefix \$RES --limitGenomeGenerateRAM 32000000000 \
--outSAMtype BAM SortedByCoordinate > run.sh


you can now inspect the contents of run.sh and verify that each command is what you wanted, followed by running the script:

bash run.sh


finally add your variable to the start of run.sh so that the whole script contains all the information that is necessary for it to work.

0
Entering edit mode

Wow, thanks for the detailed explanation. I'll try this out. Sorry for the silly question but, what do you mean by add your variable to the start of run.sh ? Do you mean the ids?

0
Entering edit mode

I meant the section dealing with environment variables:

export IDX=/scratch/eg5/trial2/align/sequence/STARindex
export GTF=/scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf
export INP=/scratch/eg51/trial2/trimmed_files/
export RES=/scratch/eg51/trial2/align/results/STARresults


should be at the start of the script

0
Entering edit mode

Got it, yes I did that, thanks!

Would you by any chance know how to resolve this: I immediately receive an error for gunzip ( " align.sh: line 7: -c: command not found ") but everything continues to run (" ..... started STAR run"), and then after about 30 mins it says finished successfully but I don't actually have anything in the STARresults folder.

I checked the log file and it looks like everything should have worked ( though what do I know) yet all I have it is : STARresults STARresultsAligned.sortedByCoord.out.bam STARresultsLog.out STARresultsLog.progress.out STARresults_STARgenome STARresults_STARtmp

0
Entering edit mode

set -uex
`

at the top of the script, it will stop the run at the first error, plus it will print the commands as they get executed.

To troubleshoot run just one command (the first the fails) study it closely fix it up until it works. The devil is in the details. This