Question

Running SortMeRNA for bulk RNAseq data

0

Entering edit mode

4.8 years ago

Farah ▴ 80

Hello,

I need to run the below commands in my university cluster to be able to do SortMeRNA for my fastq.gz files from RNA-seq experiment. I have 18 paired-end fastq.gz files naming like 30786524-PBG_NAM_18_R1.fastq.gz & 30786524-PBG_NAM_18_R2.fastq.gz

R1 shows forward read and R2 is reverse read.

All of my fastq.gz files have different names but they are common in R1.fastq.gz and R2.fastq.gz

I should save the below chunk of code in nano:

#!/bin/bash    
READ_FW="$1"
READ_RV="$2"

FILEBASE=$(basename "${READ_FW/_1.fq.gz/}")

echo "Uncompressing FASTQ data of $FILEBASE"
gunzip "$READ_FW" "$READ_RV"

READ_FW="${READ_FW%.gz}"
READ_RV="${READ_RV%.gz}"

echo "Merging pairs of $FILEBASE"
merge-paired-reads.sh "$READ_FW" "$READ_RV" "${FILEBASE}_interleaved.fq"

echo "Running SortMeRNA for $FILEBASE"
sortmerna --ref $SORTMERNA_DB --reads "${FILEBASE}_interleaved.fq" --aligned \
"${FILEBASE}-rRNA-hits" --other  "${FILEBASE}-sortmerna" --log -a 16 \
-v --paired_in --fastx

echo "Unmerging SortMeRNA filtered pairs for $FILEBASE"
unmerge-paired-reads.sh "${FILEBASE}-sortmerna.fq" \
"${FILEBASE}-sortmerna_1.fq" "${FILEBASE}-sortmerna_2.fq"

echo "Doing cleanup for $FILEBASE"
gzip "$READ_FW" "$READ_RV" "${FILEBASE}-sortmerna_1.fq" \
"${FILEBASE}-sortmerna_2.fq" "${FILEBASE}-rRNA-hits.fq"
rm "${FILEBASE}_interleaved.fq" "${FILEBASE}-sortmerna.fq"

Then, I should use a loop to execute this script:

mkdir ~/sortmerna
cd ~/sortmerna
find ../raw -name "*.fq.gz"  | sort | head -n 32 | while read READ_FW
do 
  read READ_RV
  bash ../runSortMeRNA.sh $READ_FW $READ_RV
done

As I am new in programming, may I know what is FILEBASE? Should I specify a name or a path for it?

Also, should I keep ${READ_FW/_1.fq.gz/}, READ_FW="$1" and READ_RV="$2" exactly like this or adapt them based on my data?

What about find ../raw -name "*.fq.gz" and bash ../runSortMeRNA.sh? Which parts of the codes I should adapt?

I would highly appreciate your help.

Best wishes,
Farah

sortmerna RNA-Seq • 2.8k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 4.8 years ago by Farah ▴ 80

0

Entering edit mode

Can you explain the role of --paired_in and --paired_out ? I am still confused about these two options after reading the usage. Thanks.

ADD REPLY • link 4.2 years ago by Shicheng Guo ★ 9.4k

score 1 · Answer 1 · 2019-07-04

1

Entering edit mode

4.8 years ago

Carambakaracho ★ 3.2k

Also, should I keep "${READ_FW/_1.fq.gz/}" , READ_FW="$1" and READ_RV="$2" exactly like this or adapt them based on my data?

You may need to adapt the _1.fq.gz part, it matches only forward read files ending in _1.fq.gz but doesn't work for _R1.fq.gz or _1.fastq.gz

What about find ../raw -name "*.fq.gz" and bash ../runSortMeRNA.sh ? which parts of the codes I should adapt?

../raw needs to be adapted to the path wherever your gzipped fastq files are. Same for ../runSortMeRNA.sh

ADD COMMENT • link 4.8 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

OK. Many thanks for your help.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

score 0 · Answer 2 · 2019-07-04

With FILEBASE=$(basename "${READ_FW/_1.fq.gz/}") you define the new variable called $FILEBASE as a character string using the file name of the respective READ_FW file, but without the ending "_1.fq.gz" and without the whole filepath which originally is in the filename too.

The endings file endings should be the same in the script.

The script you have is written, so that you don't have to adjust it every time.