bash script for RNA seq alignment with STAR
3
4
Entering edit mode
4.8 years ago
nicole p ▴ 40

Hi all,

I need to come up with some kind of automated script that will align each sample to the my reference genome. I am using STAR to align reads. I have reads that look something like this (all paired end):

SRR112312_1.fastq.gz 
SRR112312_2.fastq.gz
SRR112313_1.fastq.gz
SRR112313_2.fastq.gz
SRR112372_1.fastq.gz
SRR112372_2.fastq.gz
...

I have tried with code like this:

for i in $(ls *.fastq.gz | rev | cut -c 11- | rev | uniq)

My typical STAR command looks like this:

STAR --genomeDir /indices/human --readFilesIn SRR112312_1.fastq.gz SRR112312_2.fastq.gz --runThreadN 8 --outFileNamePrefix /alignment/treg_NBP_8_ --quantMode GeneCounts --readFilesCommand zcat

I tried using the answer from this post: bash loop for alignment RNA-seq data but im not having any luck. I would like to align each PE read to the genome to get a BAM file. In this example, there would be 3 BAM files. Also, all of my reads end as in _1.fastq.gz and _2.fastq.gz

If anyone can offer any help that would be much appreciated! Just jumping into scripting and this has been giving me a hard time. Thanks!

script bash RNA-Seq STAR • 12k views
ADD COMMENT
5
Entering edit mode
4.8 years ago

This is the script I am currently using, with my default settings, so it can be called just as: star-wrapper.sh reads_1.fastq.gz reads_2.fastq.gz

#!/bin/sh
set -eu

## this is a wrapper for mapping single end and paired end data

STAR=`which STAR` # adapt to your needs
### lsalmonis genome for 76bp reads
DBDIR=licebase/genomedata/lsal76ribo

BASE=`basename $1`
DNAME=`dirname $1`
#BASE=VOD/$BASE
#ZCAT=
ZCAT="--readFilesCommand zcat"
THREADS=140
OPTS_P1="--outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 100000000000 --genomeLoad LoadAndKeep --seedSearchStartLmax 8 --outFilterMultimapNmax 100 --outFilterMismatchNoverLmax 0.5 --outFileNamePrefix $DNAME/${BASE}"

if [ "$#" -ge 3  ]; then
   echo "illegal number of parameters"
   exit 1
fi

if [ "$#" -eq 2 ]; then
CMD="$STAR --runThreadN $THREADS --outBAMsortingThreadN 10 $OPTS_P1 $ZCAT  --genomeDir $DBDIR --readFilesIn $1 $2"
echo $CMD
$CMD
fi

if [ "$#" -eq 1 ]; then
CMD="$STAR --runThreadN $THREADS --outBAMsortingThreadN 10 $OPTS_P1 $ZCAT  --genomeDir $DBDIR --readFilesIn $1"
echo $CMD
$CMD
fi

Then I have a little wrapper:

#!/bin/sh
set -eu

for F in $@ ; do
  R1=`echo $F | sed s/_2/_1/`
  R2=`echo $F | sed s/_1/_2/`
  echo "Processing $R1 $R2"
  ./star-wrapper.sh $R1 $R2
  echo "Done"
done

which can be called like nohup doit.sh *_1.fastq.gz for paired end reads on all reads in the directory.

ADD COMMENT
2
Entering edit mode
4.7 years ago
Sirus ▴ 810

You can to use bpipe, to automate and parallelize your pipeline.

For example, you can create a file STAR_mapping.groovy then write your script as follow (supposing that you already have done the adaptor trimming and quality control)

runSTAR = {
   def prefix="/alignment/treg_NBP_8_ "
   exec """
   STAR --genomeDir /indices/human 
   --readFilesIn $inputs
   --runThreadN 8 
   --outFileNamePrefix ${prefix}
   --quantMode GeneCounts 
   --readFilesCommand zcat
   """
 }

 run {
   "%_*.fastq.gz " * [runSTAR]
 }

Then you can run it using the following command:

bpipe run -r -n 2 STAR_mapping.groovy *.fastq.gz

Here: - r : generate an HTML report / documentation for the pipeline -n : how many parallel job you allow to run

Here -n 2 means, you'll be running two STAR mapping jobs in the same time. In your script, each job will need 8 threads. In total you'll be using 2*8=16 threads. Also keep in mind that STAR uses about 32Gb, so here you'll need 32 *2 =64Gb.

you can set -n 1 to allow just one job at a time if you don't have too much resources.

ADD COMMENT
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2141 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6