Question

STAR multiple FASTQ alignment, can you use with a variable $?

0

Entering edit mode

3.8 years ago

science_lizard ▴ 20

Super basic question, probably more related to programming than RNA-seq itself...

I have multiple FASTQs that I want to align (from 3 separate lanes to provide more sequencing depth). According to the STAR manual, I should be able to separate these with commas in between all read 1s, then space, then commas between all read 2s. Like this: L1_R1,L2_R1,L3_R1 L1_R2,L2_R2,L3_R2

I keep getting the error: gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory

When I bring it out to the terminal to check, this is what I see:

echo $FASTQ1 $FASTQ2
/Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz /Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz

but

echo $FASTQ1,$FASTQ2
/Sample_1/fastq/*L003_001.R1.fastq.gz,/Sample_1/fastq/*L003_001.R2.fastq.gz

So clearly it recognizes the variable name, but can't deal with the comma in between. Is there a way around this or do I need to provide full variable names in the STAR script (which would be pretty annoying, since I have many files like this)?

Thanks! Here's the code for reference (I've modified it slightly since the whole path names are long):

IN=/Sample_1
genomeDir=/STAR
GTF=/gencode.v19.chr_patch_hapl_scaff.annotation.gtf

FASTQ1=$IN/fastq/*L003_001.R1.fastq.gz
FASTQ2=$IN/fastq/*L003_001.R2.fastq.gz
FASTQ3=$IN/fastq/*L004_001.R1.fastq.gz
FASTQ4=$IN/fastq/*L004_001.R2.fastq.gz
FASTQ5=$IN/fastq/*L005_001.R1.fastq.gz
FASTQ6=$IN/fastq/*L005_001.R2.fastq.gz

STAR --runThreadN 8 --genomeDir $genomeDir --sjdbGTFfile $GTF --sjdbOverhang 149 --bamRemoveDuplicatesType UniqueIdentical --readFilesIn $FASTQ1,$FASTQ3,$FASTQ5 $FASTQ2,$FASTQ4,$FASTQ6 --twopassMode Basic --outSAMtype BAM SortedByCoordinate Unsorted --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat

RNA-seq STAR Alignment bash variable • 2.6k views

ADD COMMENT • link 3.8 years ago by science_lizard ▴ 20

1

Entering edit mode

IN=/Sample_1

Defining variable this way is making the OS look for that folder/file at the base of the root directory when that variable is expanded (which is reflected in error you get gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory. Do you actually have a folder Sample_1 just under /?

Other than that what is going on with * in *L003_001.R1.fastq.gz? What is the * supposed to indicate?

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your reply! I just cut out the rest of the path name for my question online because it's super long, but I don't think that part of the variable is the issue. I think you are right that the wildcard is causing the problem, but I don't understand why. Every sample is named something like this in each respective fastq folder:

Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz

so I wanted to make a universal script that could get rid of the sample name and adapter sequence and take only the part:

*L003_001.R2.fastq.gz

Doing this to all my scripts seems to have fixed the problem, after I cd into the proper folder:

test=Sample_1_TTAGGC_BCCRUJANXX_
FASTQ1=fastq/${test}L003_001.R1.fastq.gz

But this requires me to fill out that variable for every sample, which seems like a silly way to do it. You're right that the wildcard is the issue, but why?

ADD REPLY • link 3.8 years ago by science_lizard ▴ 20

2

Entering edit mode

Try using ${FASTQ1},${FASTQ2} instead of $FASTQ1,$FASTQ2 - I'm not sure if it will work, but it's worth a shot. What would be better is to use something like

--readFilesIn \
    $(find Sample1/fastq/ -name "*L00*R1*.fastq.gz" | sort | tr "\n" ",") \
    $(find Sample1/fastq/ -name "*L00*R2*.fastq.gz" | sort | tr "\n" ",")

or something that uses ls instead of find if all FASTQ are in the same folder:

--readFilesIn \
    $(/bin/ls -1 Sample1/fastq/*L00*R1*.fastq.gz | sort | tr "\n" ",") \
    $(/bin/ls -1 Sample1/fastq/*L00*R2*.fastq.gz | sort | tr "\n" ",")

ADD REPLY • link 3.8 years ago by Ram 43k

0

Entering edit mode

Thank you! Yeah the ${FASTQ1},${FASTQ2} doesn't work either, it's something about having the comma right between the two variables makes it break down when using a wildcard. But I'll give the other options a shot, thanks!

ADD REPLY • link 3.8 years ago by science_lizard ▴ 20

1

Entering edit mode

If your file names are Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz then name=$(basename ${i} _001.R2.fastq.gz) would be the best way to extract the part that is variant. Then you reconstitute the file names for actual run using ${name}_001.R1.fastq.gz and ${name}_001.R2.fastq.gz.

ADD REPLY • link 3.8 years ago by GenoMax 141k