Question: weird STAR output, bash scripting problem
0
gravatar for nikelle.petrillo
13 months ago by
Providence College, Providence, RI
nikelle.petrillo80 wrote:

Hi all. I am trying to align PE reads with STAR. I have trimmed files (used trimmomatic) that look like *_1.1P.fastq.gz *_1.1U.fastq.gz *_2.2P.fastq.gz *_2.2U.fastq.gz . I only want to align the*_1.1P.fastq.gzand*_2.2P.fastq.gz files. Here is my code:

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)

do
STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

done

When I run this, STAR seems to be aligning each _1 and _2 read separately instead of treating them as paired end reads. For instance, my output files look like this:

filtered.ABC4454232_1.1P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam filtered.ABC4454232_2.2P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

It should just be outputing one .bam file for this sample, such as. filtered.ABC4454232.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

Any ideas what is wrong? Thanks!

bash rna-seq star • 916 views
ADD COMMENTlink modified 13 months ago by h.mon19k • written 13 months ago by nikelle.petrillo80
2
--readFilesIn ${i}.fastq.gz ${i}.fastq.gz

You are instructing STAR to use the same fastq twice for a given value of ${i}

ADD REPLYlink written 13 months ago by WouterDeCoster32k
1

Have you looked at what values i is getting for each iteration of the loop?

On a different note: Are you just running these scripts (or did you write them yourself)?

In general it would be simpler to do something like ls -1 *_1.1P.fastq.gz | sed 's/_1.1P.fastq.gz//' to grab the file names.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax56k

I found the script online but modified it to try to fit my data. echo $i spits out values such as these:

filtered.ABC4454232_1.1P
filtered.ABC4454232_2.2P
filtered.ABC4454233_1.1P
filtered.ABC4454233_2.2P
....

Which are the correct values, i guess i do not understand how to tell bash to differentiate between the _1.1P and its corresponding _2.2P

ADD REPLYlink written 13 months ago by nikelle.petrillo80

Have you tried basename that you were using in a different question you had asked yesterday to differentiate between the names?

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax56k

Yes, I thought of that but not sure how to incorporate basename into my existing code.

ADD REPLYlink written 13 months ago by nikelle.petrillo80
3
gravatar for grant.hovhannisyan
13 months ago by
grant.hovhannisyan1.1k wrote:

Grab the file name, an then do like:

--readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz
ADD COMMENTlink written 13 months ago by grant.hovhannisyan1.1k

Thank you, but when I try this the output .bam file keeps getting written over.

ADD REPLYlink written 13 months ago by nikelle.petrillo80

Can you show an example of a full name?

A good method of testing a loop is:

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
echo STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
done

Using the echo the command will not run but you will see how it looks like and what you should tweak further.

ADD REPLYlink written 13 months ago by WouterDeCoster32k

Exactly, because at every iteration STAR will output the file with the same name, and in the end for the loop you will get the file from the last iteration. To avoid this you can rename final files at each iteration.

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
STAR --genomeDir /indices/human --readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz other_commands 
mv Aligned.out.bam Log_${i}.bam
mv Log.final.out Log_${i}.final.out
mv Log.out Log_${i}.out
mv Log.progress.out Log_{i}.progress.out

done

Bear in mind that you'd need to rename all output files in each iteration, so that nothing would be overwritten by the next iteration.

ADD REPLYlink modified 13 months ago • written 13 months ago by grant.hovhannisyan1.1k

Erm and what about just using ${i} as part of the --outFileNamePrefix? You are overcomplicating this by renaming files.

ADD REPLYlink written 13 months ago by WouterDeCoster32k

Yes, your option is definitely more clean. Nevertheless, the trick with renaming files might be also useful in other situations.

ADD REPLYlink written 13 months ago by grant.hovhannisyan1.1k
1
gravatar for h.mon
13 months ago by
h.mon19k
Brazil
h.mon19k wrote:

Your current question is essentially a duplicate of your previous question, and thats the reason I argued your previous question was not a bioinformatics question: you are just manipulating bash variables, and the bioinformatics software you want to use is of no importance to this. In fact, you could use the same solution as before for this problem, instead of the cumbersome $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq):

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  STAR --genomeDir /indices/human --readFilesIn ${base}_1.1P.fastq.gz ${base}_2.2P.fastq.gz \
    --runThreadN 8 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat \
    --outFileNamePrefix /trimmed/alignments/${base}.cd177negtreg_tumor.Out 
done

As I hinted in my previous answer (and as others explicitly suggested here), have a look at your variables before running your commands:

for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  echo "file1 is ${base}_1.1P.fastq.gz, file2 is ${base}_2.2P.fastq.gz"
  echo "output will be /trimmed/alignments/${base}.cd177negtreg_tumor.Out"
done

P.S.: it is generally (though not unanimous) considered bad practice to parse the output os ls, the main reason being file names can have characters which will break your script (such as spaces) unless you account for them.

ADD COMMENTlink modified 13 months ago • written 13 months ago by h.mon19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1024 users visited in the last hour