Question: STAR multiple FASTQ alignment, can you use with a variable $?
0
gravatar for science_lizard
3 months ago by
science_lizard0 wrote:

Super basic question, probably more related to programming than RNA-seq itself...

I have multiple FASTQs that I want to align (from 3 separate lanes to provide more sequencing depth). According to the STAR manual, I should be able to separate these with commas in between all read 1s, then space, then commas between all read 2s. Like this: L1_R1,L2_R1,L3_R1 L1_R2,L2_R2,L3_R2

I keep getting the error: gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory

When I bring it out to the terminal to check, this is what I see:

echo $FASTQ1 $FASTQ2
/Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz /Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz

but

echo $FASTQ1,$FASTQ2
/Sample_1/fastq/*L003_001.R1.fastq.gz,/Sample_1/fastq/*L003_001.R2.fastq.gz

So clearly it recognizes the variable name, but can't deal with the comma in between. Is there a way around this or do I need to provide full variable names in the STAR script (which would be pretty annoying, since I have many files like this)?

Thanks! Here's the code for reference (I've modified it slightly since the whole path names are long):

IN=/Sample_1
genomeDir=/STAR
GTF=/gencode.v19.chr_patch_hapl_scaff.annotation.gtf

FASTQ1=$IN/fastq/*L003_001.R1.fastq.gz
FASTQ2=$IN/fastq/*L003_001.R2.fastq.gz
FASTQ3=$IN/fastq/*L004_001.R1.fastq.gz
FASTQ4=$IN/fastq/*L004_001.R2.fastq.gz
FASTQ5=$IN/fastq/*L005_001.R1.fastq.gz
FASTQ6=$IN/fastq/*L005_001.R2.fastq.gz

STAR --runThreadN 8 --genomeDir $genomeDir --sjdbGTFfile $GTF --sjdbOverhang 149 --bamRemoveDuplicatesType UniqueIdentical --readFilesIn $FASTQ1,$FASTQ3,$FASTQ5 $FASTQ2,$FASTQ4,$FASTQ6 --twopassMode Basic --outSAMtype BAM SortedByCoordinate Unsorted --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat
ADD COMMENTlink written 3 months ago by science_lizard0
1
IN=/Sample_1

Defining variable this way is making the OS look for that folder/file at the base of the root directory when that variable is expanded (which is reflected in error you get gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory. Do you actually have a folder Sample_1 just under /?

Other than that what is going on with * in *L003_001.R1.fastq.gz? What is the * supposed to indicate?

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax91k

Thanks for your reply! I just cut out the rest of the path name for my question online because it's super long, but I don't think that part of the variable is the issue. I think you are right that the wildcard is causing the problem, but I don't understand why. Every sample is named something like this in each respective fastq folder:

Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz

so I wanted to make a universal script that could get rid of the sample name and adapter sequence and take only the part:

*L003_001.R2.fastq.gz

Doing this to all my scripts seems to have fixed the problem, after I cd into the proper folder:

test=Sample_1_TTAGGC_BCCRUJANXX_
FASTQ1=fastq/${test}L003_001.R1.fastq.gz

But this requires me to fill out that variable for every sample, which seems like a silly way to do it. You're right that the wildcard is the issue, but why?

ADD REPLYlink written 3 months ago by science_lizard0
1

Try using ${FASTQ1},${FASTQ2} instead of $FASTQ1,$FASTQ2 - I'm not sure if it will work, but it's worth a shot. What would be better is to use something like

--readFilesIn \
    $(find Sample1/fastq/ -name "*L00*R1*.fastq.gz" | sort | tr "\n" ",") \
    $(find Sample1/fastq/ -name "*L00*R2*.fastq.gz" | sort | tr "\n" ",")

or something that uses ls instead of find if all FASTQ are in the same folder:

--readFilesIn \
    $(/bin/ls -1 Sample1/fastq/*L00*R1*.fastq.gz | sort | tr "\n" ",") \
    $(/bin/ls -1 Sample1/fastq/*L00*R2*.fastq.gz | sort | tr "\n" ",")
ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS30k

Thank you! Yeah the ${FASTQ1},${FASTQ2} doesn't work either, it's something about having the comma right between the two variables makes it break down when using a wildcard. But I'll give the other options a shot, thanks!

ADD REPLYlink written 3 months ago by science_lizard0
1

If your file names are Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz then name=$(basename ${i} _001.R2.fastq.gz) would be the best way to extract the part that is variant. Then you reconstitute the file names for actual run using ${name}_001.R1.fastq.gz and ${name}_001.R2.fastq.gz.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2049 users visited in the last hour