STAR alignment in a full directory
1
0
Entering edit mode
7 weeks ago
Pahi • 0

Hi!

I'm working with STAR and I would like to align multiple file, but separately. I have file paires like these two:

Dros_01_S48_L001_R1_001.fastq.gz
Dros_01_S48_L001_R2_001.fastq.gz


etc. Only the S[number] changes and the R1 and R2 in the names.

I have a code, what I find on other forums, and reworked it a bit, to match my files:

for i in $(ls 1_raw_data | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \ --readFilesIn 1_raw_data/${i}_R1_001.fastq.gz 1_raw_data/{i}_R2_001.fastq.gz \ --runThreadN 20 --outFileNamePrefix 3_aligned/i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \


My problem, that during the readFilesIn part the files what STAR load in renamed as the following:

Dros_01_S48_L001_R1_001.fastq.gz_R1_001.fastq.gz
Dros_01_S48_L001_R1_001.fastq.gz_R2_001.fastq.gz


So basically it adds more to its name.

What could be the problem?

STAR loop • 458 views
0
Entering edit mode

for i in (ls 1_raw_data | sort -u); do you should use a workflow manager My problem, that during the readFilesIn what's in 1_raw_data ? ADD REPLY 0 Entering edit mode 1_raw_data is a folder where all the fastq.gz file are located. you should use a workflow manager I'm not sure what is that. ADD REPLY 0 Entering edit mode 1_raw_data is a folder where all the fastq.gz file are located. of course, can you please give us an example I'm not sure what is that. https://www.nextflow.io/ https://snakemake.readthedocs.io/en/stable/ ADD REPLY 2 Entering edit mode 7 weeks ago antonioggsousa ★ 2.4k Hi, If I understood right your problem, what you're doing does not make sense. You're looping over the forward (R1) and reverse (R2) fastq files inside the folder 1_raw_data. This means that for each sample you'll attempt to align it twice, one for the forward file and a second time for the reverse, because you're looping over the files inside the folder, which have different names for the forward and reverse fastq samples. Since you're looping over the file names that are inside the folder, the variable i will assume the names of the files, and of course, when you add to that the constant string _R1_001.fastq.gz or _R2_001.fastq.gz to i, it'll just append these strings to the file names. So, what you want to do is to loop over the forward or the reverse fastq files once for each sample, retrieve only the string that matches the sample, and align once the respective forward and reverse fastq samples against the reference. To do so you can attempt the following (first backup your files - never run analyses on the original copy!) - I'm using your loop above and I'm assuming that the sample names match between forward and reverse and the only difference between them is the R1 and R2 tags, respectively: for i in(ls 1_raw_data/*_R1_* | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \
--readFilesIn 1_raw_data/${i} 1_raw_data/${i/_R1_/_R2_} \
--runThreadN 20 --outFileNamePrefix 3_aligned/${i/_R1_001.fastq.gz/} \ --outSAMtype BAM SortedByCoordinate \ --quantMode GeneCounts \ --sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \ --readFilesCommand gunzip -c ; done;  Now, you'll only loop over the forward files (s 1_raw_data/*_R1_*) which is the same that looping over each sample once. Then you will make use of parameter expansion in linux to substitute the R1 tag from the forward file name to R2 in the reverse file name. You do this by specifying ${i/_R1_/_R2_} which means to pick up the name of the variable i which represents Dros_01_S48_L001_R1_001.fastq.gz, find _R1_ and replace by _R2_. I'm assuming that you want to use as prefix the sample name without the suffix _R1_001.fastq.gz that's why I changed the previous code to --outFileNamePrefix 3_aligned/{i/_R1_001.fastq.gz/}.

I just want to reinforce the message of Pierre Lindenbaum by strongly recommend you the use of a workflow manager. It is the best way to make your analysis more reproducible, scalable etc, and it will save you tons of time later.

I hope this helps,

António

0
Entering edit mode

Dear António!

Thank you for your, help and explanation. It helped me a lot. But still I had a problem with the names, so I removed the 1_raw_data part from:

--readFilesIn 1_raw_data/${i} 1_raw_data/${i/_R1_/_R2_} \


With that now I see that the code is just right. Thank you!

If I may bother you with another question, I had another data set, where the read names were:

12345_1.fastq.gz
12345_2.fastq.gz
etc.


On that data set, I used my original code, what I posted, where the readFilesIn part was:

--readFilesIn 1_raw_data/${i}_1.fastq.gz 1_raw_data/${i}_2.fastq.gz


And in that case it works. Was it becouse there are less "_" in the names?

Thank you again for your time and help!

1
Entering edit mode

Hi @Pahi!

Regarding your question, as far as I know bash, if you have used the same code as above with your second example, it shouldn't work either. I guess it depends on what ${i} represents in the 2 example. If ${i} represents 12345, yes it should work. If \${i} represents the whole file name as in your 1st example, it shouldn't work (at least I'm not able to understand why would work).

Having one or more _ is irrelevant in the case of your code (at least as far I can tell - I'm not an expert in bash).

But without having access to your computer to test it, it is difficult to understand why it worked before.

Best,

António