weird STAR output, bash scripting problem
2
0
Entering edit mode
5.3 years ago

Hi all. I am trying to align PE reads with STAR. I have trimmed files (used trimmomatic) that look like *_1.1P.fastq.gz *_1.1U.fastq.gz *_2.2P.fastq.gz *_2.2U.fastq.gz . I only want to align the*_1.1P.fastq.gzand*_2.2P.fastq.gz files. Here is my code:

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq) do STAR --genomeDir /indices/human --readFilesIn${i}.fastq.gz {i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/{i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

done


When I run this, STAR seems to be aligning each _1 and _2 read separately instead of treating them as paired end reads. For instance, my output files look like this:

filtered.ABC4454232_1.1P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam filtered.ABC4454232_2.2P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

It should just be outputing one .bam file for this sample, such as. filtered.ABC4454232.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

Any ideas what is wrong? Thanks!

RNA-Seq STAR bash • 2.5k views
2
Entering edit mode
--readFilesIn ${i}.fastq.gz${i}.fastq.gz


You are instructing STAR to use the same fastq twice for a given value of ${i} ADD REPLY 1 Entering edit mode Have you looked at what values i is getting for each iteration of the loop? On a different note: Are you just running these scripts (or did you write them yourself)? In general it would be simpler to do something like ls -1 *_1.1P.fastq.gz | sed 's/_1.1P.fastq.gz//' to grab the file names. ADD REPLY 0 Entering edit mode I found the script online but modified it to try to fit my data. echo$i spits out values such as these:

filtered.ABC4454232_1.1P
filtered.ABC4454232_2.2P
filtered.ABC4454233_1.1P
filtered.ABC4454233_2.2P
....


Which are the correct values, i guess i do not understand how to tell bash to differentiate between the _1.1P and its corresponding _2.2P

0
Entering edit mode

Have you tried basename that you were using in a different question you had asked yesterday to differentiate between the names?

0
Entering edit mode

Yes, I thought of that but not sure how to incorporate basename into my existing code.

3
Entering edit mode
5.3 years ago

Grab the file name, an then do like:

--readFilesIn ${i}_1.1P.fastq.gz${i}_2.2P.fastq.gz

0
Entering edit mode

Thank you, but when I try this the output .bam file keeps getting written over.

0
Entering edit mode

Can you show an example of a full name?

A good method of testing a loop is:

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq) do echo STAR --genomeDir /indices/human --readFilesIn${i}.fastq.gz {i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/{i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
done


Using the echo the command will not run but you will see how it looks like and what you should tweak further.

0
Entering edit mode

Exactly, because at every iteration STAR will output the file with the same name, and in the end for the loop you will get the file from the last iteration. To avoid this you can rename final files at each iteration.

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq) do STAR --genomeDir /indices/human --readFilesIn${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz other_commands mv Aligned.out.bam Log_${i}.bam
mv Log.final.out Log_${i}.final.out mv Log.out Log_${i}.out
mv Log.progress.out Log_{i}.progress.out

done


Bear in mind that you'd need to rename all output files in each iteration, so that nothing would be overwritten by the next iteration.

0
Entering edit mode

Erm and what about just using ${i} as part of the --outFileNamePrefix? You are overcomplicating this by renaming files. ADD REPLY 0 Entering edit mode Yes, your option is definitely more clean. Nevertheless, the trick with renaming files might be also useful in other situations. ADD REPLY 1 Entering edit mode 5.3 years ago h.mon 34k Your current question is essentially a duplicate of your previous question, and thats the reason I argued your previous question was not a bioinformatics question: you are just manipulating bash variables, and the bioinformatics software you want to use is of no importance to this. In fact, you could use the same solution as before for this problem, instead of the cumbersome $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq):

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in *_1.1P.fastq.gz
do
base=${i%%_1.1P.fastq.gz} STAR --genomeDir /indices/human --readFilesIn${base}_1.1P.fastq.gz {base}_2.2P.fastq.gz \ --runThreadN 8 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat \ --outFileNamePrefix /trimmed/alignments/{base}.cd177negtreg_tumor.Out
done


As I hinted in my previous answer (and as others explicitly suggested here), have a look at your variables before running your commands:

for i in *_1.1P.fastq.gz
do
base=${i%%_1.1P.fastq.gz} echo "file1 is${base}_1.1P.fastq.gz, file2 is {base}_2.2P.fastq.gz" echo "output will be /trimmed/alignments/{base}.cd177negtreg_tumor.Out"
done


P.S.: it is generally (though not unanimous) considered bad practice to parse the output os ls, the main reason being file names can have characters which will break your script (such as spaces) unless you account for them.