Question: STAR align multiple files
2
gravatar for ta_awwad
2.2 years ago by
ta_awwad230
Frankfurt am Main
ta_awwad230 wrote:

Hi everybody, I am doing alignment to 36 PE samples using star. to make it little bit easy task I wrote a bash loop to align them all with the same command. here is my loop:

for i in $(ls raw_data); do STAR --genomeDir index.150 \
--readFilesIn raw_data/$i\_1.fq.gz,raw_data/$i\_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

but it seems that something wrong as the alignment took overnight and it was not done yet.

any recommendation

thanks much

rna-seq star chip-seq alignment • 5.4k views
ADD COMMENTlink modified 19 months ago by Bog20 • written 2.2 years ago by ta_awwad230
4

For 36 samples, you could speed up by loading the index into memory, and unloading when finished mapping:

STAR --genomeLoad LoadAndExit --genomeDir index.150

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u)
do
    STAR [...]
done

STAR --genomeLoad Remove --genomeDir index.150
ADD REPLYlink written 2.2 years ago by h.mon28k

Thank you all for these price less info..

ADD REPLYlink written 2.2 years ago by ta_awwad230

Hi h.mon,

Could you tell me what is the purpose of index.150 here? Can we just type the location of the genome after --genomeDir?

ADD REPLYlink written 9 months ago by c_u140
1

Yes. In the example given index.150 is the name of the index that was in the original question. Replace that with yours.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax75k

If you load the genome before the for loop using: STAR --genomeLoad LoadAndExit --genomeDir genomeDIR Do you still need to specify the --genomeDir parameter in the loop? I tried leaving that out, and STAR failed to run. Then I tried specifying the genome directory in the loop (even though the genome is loaded before the FOR loop), and it looks like each iteration of the loop is still loading the genome.

Can someone explain how to properly load the genome for multiple samples so that the loop is not iteratively loading it, please?

ADD REPLYlink written 7 months ago by ricardo388920

First you load the genome using --genomeDir $GENOMEDIR --genomeLoad LoadAndExit. For your alignment(s) you need --genomeDir $GENOMEDIR --genomeLoad LoadAndKeep

ADD REPLYlink written 7 months ago by WouterDeCoster42k
2

When looping, test if your code is valid by adding an echo statement to see what the command is going to be:

for i in $(ls raw_data); do echo STAR --genomeDir index.150 \
--readFilesIn raw_data/$i\_1.fq.gz,raw_data/$i\_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

My guess is that the files raw_data/$i_1.fq.gz don't exist because you create $i simply based on the content of raw_data

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

thanks much WouterDeCoster for your reply. I run your code and got this:

STAR --genomeDir /index.150 --readFilesIn raw_data/KO_day3_1_1.fq.gz_1.fq.gz raw_data/KO_day3_1_1.fq.gz_2.fq.gz --runThreadN 20 --outFileNamePrefix aligned/KO_day3_1_1.fq.gz. --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile GRCm38.90.gtf --readFilesCommand zcat

you are right. the file name became different.

any suggestion to correct this??

thanks much

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by ta_awwad230
1

Can you show a few examples of filenames of the fq.gz files?

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k
KO_day3_1_1.fq.gz           KO_day4_1_2.fq.gz   mESC_KO_3_1.fq.gz  mESC_KO_3_2.fq.gz      mESC_Wt3_1.fq.gz    mESC_Wt3_2.fq.gz        PG_4WT10_07_17_1.fq.gz    PG_4WT10_07_17_2.fq.gz  PG_7Swht16_07_17_1.fq.gz  PG_7Swht16_07_17_2.fq.gz
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by ta_awwad230
2

You could try something like:

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u); do echo STAR --genomeDir index.150 \
--readFilesIn raw_data/${i}_1.fq.gz,raw_data/${i}_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

I modified the $i to be shorter, and only keep unique hits since all samples will be in there twice.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

Thanks much ... it is running now .. but I am not sure how much time it will take .. I will inform you if everything run fine

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by ta_awwad230

it looks like it is stuck .. no progress since 30 minutes .. is it normal???

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by ta_awwad230

You can have a look with (h)top to see if it's still working. Also, check if it's producing output files.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

I think the problem was that STAR doesn't accept compressed files.

ADD REPLYlink written 2.2 years ago by ta_awwad230
1

it accepts but you need to specify : --readFilesCommand zcat

ADD REPLYlink written 2.2 years ago by Nicolas Rosewick8.5k

I did .. and it did not work

ADD REPLYlink written 2.2 years ago by ta_awwad230

Works just fine for me, use it all the time.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

"it did not work" doesn't help us know what went wrong, what is the error message? STAR does accept gz compressed files.

ADD REPLYlink written 2.2 years ago by h.mon28k

just stuck no error message no progress

ADD REPLYlink written 2.1 years ago by ta_awwad230
2

Try gunzip instead. It works with that.

ADD REPLYlink written 19 months ago by Bog20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 740 users visited in the last hour