bash loop for alignment RNA-seq data
1
5
Entering edit mode
7.5 years ago
Paul ★ 1.4k

Dear all,

Could you help he please to modify my loop in bash to multiple alignment my FASTQ files?

I have a many fastq files R1 and R2 and I would like to align all those read in bash loop - lets say I have X1_R1_001.fastq + X1_R2_001-fastq and Y1_R1_001.fastq + Y1_R2_001.fastq. Sample X1 is pair-end read R1 + R2 and so on.

#!/bin/bash
for i in *fastq;
do tophat2 -o ${i%.fastq}tpout -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index${i}R1_001.fastq {i}R2_001.fastq --rg-id X1 --rg-sample X1 --rg-library rna-seq --rg-platform Illumina done;  And I would like to also change my --rg-id tag with name of my fastq files (in this example X1 in first loop and X2 in second loop). Output folder should have the name of the sample too. Please do you have any idea how to modify my bash loop? Thank you so much for any ideas! Paul RNA-Seq next-gen alignment • 18k views ADD COMMENT 1 Entering edit mode Hi, Quick question: Do your FASTQ files follow a consistent naming convention? ADD REPLY 0 Entering edit mode Hi Ram, thank you for reply, yeah there is allways number 1_S1_L001_R1_001.fastq + 1_S1_L001_R2_001.fastq, 2_S2_L001_R1_001.fastq + 2_S2_L001_R2_001.fastq and so on .... This is typical output form MiSeq.. Paul. ADD REPLY 0 Entering edit mode OK,so the _R[12]_001.fastq is the constant part of the names, anything before that is prefix. Cool, I'll get back to you in some time. ADD REPLY 0 Entering edit mode Thank you so much... I try to figure out some solution also :-) ADD REPLY 0 Entering edit mode for i in(ls *.fa.gz | rev | cut -c 4- | rev | uniq)
do
mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30 ${i}.fa.gz${i}.fa.gz  ../Alignments/{fq1%.fa.gz}.alignments.vdjca; done  Hi Ram, i used code to align my pair end RNA seq. we have performed single cells TCR seq and in one sample, i mixed several samples by adding unique adaptor to each samples. So after demultiplexing i got around 10 files of R1(as i used 10 barcode) and 10 files of R2. however name samples are same except before .fa.gz, it comes like _1.fa.gz and for next sample _2.fa.gz etc. Now when i used your script by editing specific thing and run the command. Scripting is not working. I get msg that there are no such file call .fa.gz Here i am pasting name of the samples for your consideration. Could you please help me solve the issue. 9_S9_L001_R1_001_1.fa.gz 9_S9_L001_R1_001_2.fa.gz 9_S9_L001_R1_001_3.fa.gz 9_S9_L001_R1_001_4.fa.gz 9_S9_L001_R1_001_5.fa.gz 9_S9_L001_R1_001_6.fa.gz ADD REPLY 0 Entering edit mode Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. ADD REPLY 0 Entering edit mode code for i in(ls *.fa.gz | rev | cut -c 4- | rev | uniq) prints sample_name.fa instead of sample_name. ${i}.fa.gz in your code appends extra fa (for eg. 9_S9_L001_R1_001_1.fa.fa.gz) to the sample name before extension .fa.gz. This might be one problem. In addition to this, you are using same file (${i}.fa.gz ${i}.fa.gz) twice. Is there a reason for this? I tried to echo the input with three of your sample files (for R1 and R2): $ for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq); do echo$i; echo ${i}.fa.gz;done 9_S9_L001_R1_001_1.fa 9_S9_L001_R1_001_1.fa.fa.gz 9_S9_L001_R1_001_2.fa 9_S9_L001_R1_001_2.fa.fa.gz 9_S9_L001_R1_001_3.fa 9_S9_L001_R1_001_3.fa.fa.gz 9_S9_L001_R2_001_1.fa 9_S9_L001_R2_001_1.fa.fa.gz 9_S9_L001_R2_001_2.fa 9_S9_L001_R2_001_2.fa.fa.gz 9_S9_L001_R2_001_3.fa 9_S9_L001_R2_001_3.fa.fa.gz$ ls *.gz
9_S9_L001_R1_001_1.fa.gz  9_S9_L001_R1_001_2.fa.gz  9_S9_L001_R1_001_3.fa.gz  9_S9_L001_R2_001_1.fa.gz  9_S9_L001_R2_001_2.fa.gz  9_S9_L001_R2_001_3.fa.gz

0
Entering edit mode

If i will use the sample name then i have to write sample name for all. It means i have total 1-49 samples and then for each samples 10 barcode so in total i have 490 R1 file and 490 R2 file. No there are no reason to use that but i am getting how to specify R1 and R2, as my sample unique name is 9_S9..._1.fa.gz. so i to specify that i am not getting.

0
Entering edit mode

What I meant was that code (provided by you) is appending an extra fa to your input files. In stead of passing 9_S9_L001_R2_001_3.fa.gz, it is passing 9_S9_L001_R2_001_3.fa.fa.gz for execution. That might be one reason why it is not finding files.

I printed the input files that are being sent to program in above command.

0
Entering edit mode

that true, how could modify this script so it can work?

0
Entering edit mode

See my response below. Unless you want the fa as part of the prefix, use a different index to cut with, or better yet, use sed.

0
Entering edit mode

I've updated my answer below with some tips.

0
Entering edit mode

These are shell questions, my friend - it takes a little trial and error to get them right. for examples, why a cut -c 4- when the common ending (.fa.gz) itself is 6 characters? Only a rev | cut -c 7- will give you a list of prefixes. In fact, you can just use ls *.fa.gz | sed 's/.fa.gz//'

0
Entering edit mode

for i in (ls *.gz | sed s/.fa.gz//) do mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30{i} ../Alignments/{i%.fa.gz}.alignments.vdjca; done hi Now i have changed things as you suggested. but still i say there are no such file or directory. Here i do not understand one point if i am not providing R1 and R2 separately then how will recognize the partner. i am sorry i am bit new to system that why my questions are bit basic level. ADD REPLY 0 Entering edit mode About file name based assumptions, I'm guessing that's on the tool, not on bash. If your tool (mixcr align) substitutes an R1. with a R2. automatically to look for a corresponding file, it can (in your words) recognize the partner. If not, like most other tools, it will have two input parameters, one for each file. Also, I'm guessing your changing-suffix-after-common-prefix as you describe below is the core challenge. In your command above, you've sed-substituted away the .fa.gz and still are using {i%.fa.gz}. Why is that?

0
Entering edit mode

It would help if you could post the example code for accepted and intended parameters for this tool (mixcr) with one paired end sample. Otherwise, visitors would not know how input files are passed to the tool and in what order.

18
Entering edit mode
7.5 years ago
Ram 34k

So, this is what I came up with, based on our discussion:

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq) do tophat2 -o${i%.fastq}tpout  -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}_R1_001.fastq${i}_R2_001.fastq --rg-id ${i} --rg-sample${i} --rg-library rna-seq --rg-platform Illumina
done


Hope this helps!

OK, this answer is gotten a little popular, and my shell knowledge has improved so I'm going to enhance this with some failsafes to address commonly observed challenges.

First off, the rev | cut -c 13- is just asking for trouble. Let's substitute that with a sed so the command makes a little more sense. Here, I'm trying to remove all _R[12]_001.fastq.

ls *.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq looks better. I've also edited the variable name from i so it makes a little more sense. And I've added some formatting changes plus quotes to avoid shell problems.

Here's the updated code:

for prefix in $(ls *.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq) do tophat2 -o${prefix%.fastq}tpout  \
-G path/to/my/reference.gtf -p 8 \
path/to/my/bowtie_index \
"${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq" \
--rg-id "${prefix}" --rg-sample "${prefix}" --rg-library rna-seq --rg-platform Illumina
done

1
Entering edit mode

Wau that looks good - pleae could you explain the first line -

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq)  Thank you very much, Paul. ADD REPLY 1 Entering edit mode Thank you. Sure, the first line extracts unique prefixes. The pipe does something like this: 1. Pick all .fastq files 2. Reverse their names 3. Pick from char 13 to end 4. reverse again (as the last 13 characters are constant, we use that to pick first n characters) 5. take a uniq on this prefix set. Use this uniq set to run the script :) Edit: $(command) is an alternate, clearer way of writing command

0
Entering edit mode

Paul,

If this helped, I'd appreciate it if you accepted the answer please. Thank you!

1
Entering edit mode

Yeah of course Ram, thank you so much for your time and help!!!

Paul.

1
Entering edit mode

Thank you, Paul. It is always fun working on bash! Have a great weekend!

0
Entering edit mode

+1 for rev

0
Entering edit mode

It's a wonderful piece of our toolkit, isn't it? It is for these reasons that I love Unix!

0
Entering edit mode

do I have to run this thing from the same directory i have my fastq files ? or I can run it from anywhere ?

1
Entering edit mode

This script is customized to OP's situation. You may want to modify it to fit your use case. Given the script involves a ls *.fastq, you'll need to either run a valid version of it in the dir with the fastq files, or change the ls part of the script suitably.

0
Entering edit mode

so in place of ls i just defined the path that contains my fastq files so that would work?

0
Entering edit mode

as i defined the whole path which contains my fastq files its showing that its a directory ..the command is not being executed

1
Entering edit mode

How well did you understand/customize the script? The cut operates on exact character numbering, and might not be suitable to your case. If you cut the first 13 characters of a long path name, you might end up with a character string that doesn't mean much.

Have fun with bash! :)

0
Entering edit mode

Hi,

Just a general question about the arguments used for the loop designed here. What is the difference between ${i%.fastq} used to indicate the output or simply using${i}?

Thanks!

4
Entering edit mode

Let's say i="HELLO.txt.txt"

echo $i #HELLO.txt.txt echo$iX #No output - $iX is not a variable with any value here echo${i}X #HELLO.txt.txtX - ${} isolates the string inside the {} as the variable name echo${i%.txt} #HELLO.txt - removes shortest match for '.txt' from the end of the string


So, ${i%.fastq} removes the smallest .fastq from the end of the filename (so, its extension). Hope this helps. EDIT: With context, I'm using the (kinda) basename of the file - the name of the file stripped of its fastq extension to name the output. I'm sure there are other ways to achieve this, maybe with basename or seds, but this fit best here. ADD REPLY 0 Entering edit mode Hi Ram, thank you so much for the quick reply. Very helpful information for scripting loops! ADD REPLY 0 Entering edit mode I think its getting over written in the tpout folder ...can you suggest whats wrong with my loop , im using this have a look for i in$(ls *.fastq | rev | cut -c 9- | rev | uniq); do tophat2 -o $(%i.fastq)tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38${i}_1.fastq ${i}_2.fastq ;done  ADD REPLY 1 Entering edit mode $() executes the statement within the (). ${} is used to delimit variable names. Are you sure you're doing it right with the $(%i.fastq)tout part?

0
Entering edit mode

Immediate issue is $(%i} in your code. You should use${i%}, not ${%i}, to my knowledge, in your code. However your variable depends on your sample names. could you paste your sample names? one for each type (forward and reverse)? ADD REPLY 0 Entering edit mode Yes, it's always ${} for parameter expansion/ string substitution

0
Entering edit mode

okay i'm doing as you mentioned but for this i would like to know "tophat2 -o ${i%.fastq}tpout" is it going to put my output in two different folders , as i have two pairs of paired end read ADD REPLY 0 Entering edit mode I don't know how tophat works, I was helping with bash here :-) ADD REPLY 0 Entering edit mode Thanks Ram! one of the issue with my samples are that they are coming from barcoding demultiplex that means all the name would be same except number just before .fa.gz such as _1.fa.gz. so how could i give my input sample name such as, example you have provided is common in all the samples so that will work well but in my sample only common part is .fa.gz${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq" 9_S9_L001_R1_001_1.fa.gz 9_S9_L001_R1_001_2.fa.gz 9_S9_L001_R1_001_3.fa.gz  ADD REPLY 1 Entering edit mode I'd recommend breaking down your problem into at least 2 steps. For the first step, collect all filenames in a separate file. Once you do that, you can create dry-run commands that just echo stuff so you can test your cutting and sed-ing. For example, if a file named filelist.list had the 3 file names you gave me, I could do: sed 's/.fa.gz//' filelist.list | sed -r 's/(_[0-9])$/\t\1/' | tee prefixes_and_suffixes.txt


Output:

9_S9_L001_R1_001    _1
9_S9_L001_R1_001    _2
9_S9_L001_R1_001    _3


You could then use this new tab separated file (the arbitrarily named prefixes_and_suffixes.txt) to do what you will - awk and cut as well as uniq will be really useful here. The only way you'll learn bash is by this sort of trial and error. This will be a good exercise. Good luck!

0
Entering edit mode

Life would be so much simpler if you had chosen to include that _1, _2 before _L001_R1_001.

0
Entering edit mode

You can use nested loop as following:

$for i in$(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo$i"_R1"_$j;done;done.  Replace R2 for another read pair in place R1. My assumption is that, for 2 samples with paired end reads (R1 and R2), your files are : $ ls *.gz
9_S9_L001_R1_001_1.fa.gz
9_S9_L001_R2_001_2.fa.gz
9_S9_L001_R2_001_1.fa.gz
9_S9_L001_R1_001_2.fa.gz


output for R1:

  $for i in$(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo$i"_R1"_$j;done;done 9_S9_L001_R1_1.fa.gz 9_S9_L001_R1_2.fa.gz  output for R2: $ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in$(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R2"_$j;done;done
9_S9_L001_R2_1.fa.gz
9_S9_L001_R2_2.fa.gz


Let us say your program takes following commands and you have several samples with R1 and R2:

$program -p <R1>.fa <R2>.fa  Above script would be: $ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in$(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do program -p $i"_R1"_$j  $i"_R2"_$j;done;done


Try with two samples at first to see if program is working. Let me know if there are any issues.

0
Entering edit mode

Hi, thank you for this. Can you explain the sed -r 's/_R[12]_001[.]fastq// change?

0
Entering edit mode

Sure, that command uses sed to delete a part of a string from the string. Are you familiar with sed and regular expressions?