Question: bash loop for alignment RNA-seq data
5
gravatar for Paul
4.5 years ago by
Paul1.1k
European Union
Paul1.1k wrote:

Dear all,

Could you help he please to modify my loop in bash to multiple alignment my FASTQ files?

I have a many fastq files R1 and R2 and I would like to align all those read in bash loop - lets say I have X1_R1_001.fastq + X1_R2_001-fastq and Y1_R1_001.fastq + Y1_R2_001.fastq. Sample X1 is pair-end read R1 + R2 and so on.

#!/bin/bash
for i in *fastq;
do tophat2 -o ${i%.fastq}tpout -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}R1_001.fastq ${i}R2_001.fastq --rg-id X1 --rg-sample X1 --rg-library rna-seq --rg-platform Illumina
done;

And I would like to also change my --rg-id tag with name of my fastq files (in this example X1 in first loop and X2 in second loop). Output folder should have the name of the sample too.

Please do you have any idea how to modify my bash loop?

Thank you so much for any ideas!

Paul

rna-seq alignment next-gen • 9.5k views
ADD COMMENTlink modified 11 months ago by parmarsachinsingh0 • written 4.5 years ago by Paul1.1k
1

Hi,

Quick question: Do your FASTQ files follow a consistent naming convention?

ADD REPLYlink written 4.5 years ago by RamRS18k

Hi Ram,

thank you for reply, yeah there is allways number 1_S1_L001_R1_001.fastq + 1_S1_L001_R2_001.fastq, 2_S2_L001_R1_001.fastq + 2_S2_L001_R2_001.fastq and so on ....

This is typicla output form MiSeq.. 

Paul.

ADD REPLYlink written 4.5 years ago by Paul1.1k

OK,so the "_R[12]_001.fastq" is the constant part of the names, anything before that is prefix. Cool, I'll get back to you in some time.

ADD REPLYlink written 4.5 years ago by RamRS18k

Thank you so much... I try to figure out some solution also :-)

ADD REPLYlink written 4.5 years ago by Paul1.1k
for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq)
do
mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30 ${i}.fa.gz ${i}.fa.gz  ../Alignments/${fq1%.fa.gz}.alignments.vdjca; 
done

Hi Ram, i used code to align my pair end RNA seq. we have performed single cells TCR seq and in one sample, i mixed several samples by adding unique adaptor to each samples. So after demultiplexing i got around 10 files of R1(as i used 10 barcode) and 10 files of R2. however name samples are same except before .fa.gz, it comes like _1.fa.gz and for next sample _2.fa.gz etc. Now when i used your script by editing specific thing and run the command. Scripting is not working. I get msg that there are no such file call .fa.gz Here i am pasting name of the samples for your consideration. Could you please help me solve the issue.

9_S9_L001_R1_001_1.fa.gz

9_S9_L001_R1_001_2.fa.gz

9_S9_L001_R1_001_3.fa.gz

9_S9_L001_R1_001_4.fa.gz

9_S9_L001_R1_001_5.fa.gz

9_S9_L001_R1_001_6.fa.gz

ADD REPLYlink modified 11 months ago • written 11 months ago by parmarsachinsingh0

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax57k

code for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq) prints sample_name.fa instead of sample_name. ${i}.fa.gz in your code appends extra fa (for eg. 9_S9_L001_R1_001_1.fa.fa.gz) to the sample name before extension .fa.gz. This might be one problem. In addition to this, you are using same file (${i}.fa.gz ${i}.fa.gz) twice. Is there a reason for this?

I tried to echo the input with three of your sample files (for R1 and R2):

$ for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq); do echo $i; echo ${i}.fa.gz;done
9_S9_L001_R1_001_1.fa
9_S9_L001_R1_001_1.fa.fa.gz
9_S9_L001_R1_001_2.fa
9_S9_L001_R1_001_2.fa.fa.gz
9_S9_L001_R1_001_3.fa
9_S9_L001_R1_001_3.fa.fa.gz
9_S9_L001_R2_001_1.fa
9_S9_L001_R2_001_1.fa.fa.gz
9_S9_L001_R2_001_2.fa
9_S9_L001_R2_001_2.fa.fa.gz
9_S9_L001_R2_001_3.fa
9_S9_L001_R2_001_3.fa.fa.gz

$ ls *.gz
9_S9_L001_R1_001_1.fa.gz  9_S9_L001_R1_001_2.fa.gz  9_S9_L001_R1_001_3.fa.gz  9_S9_L001_R2_001_1.fa.gz  9_S9_L001_R2_001_2.fa.gz  9_S9_L001_R2_001_3.fa.gz
ADD REPLYlink modified 11 months ago • written 11 months ago by cpad01129.3k

If i will use the sample name then i have to write sample name for all. It means i have total 1-49 samples and then for each samples 10 barcode so in total i have 490 R1 file and 490 R2 file. No there are no reason to use that but i am getting how to specify R1 and R2, as my sample unique name is 9_S9..._1.fa.gz. so i to specify that i am not getting.

ADD REPLYlink written 11 months ago by parmarsachinsingh0

What I meant was that code (provided by you) is appending an extra fa to your input files. In stead of passing 9_S9_L001_R2_001_3.fa.gz, it is passing 9_S9_L001_R2_001_3.fa.fa.gz for execution. That might be one reason why it is not finding files.

I printed the input files that are being sent to program in above command.

ADD REPLYlink modified 11 months ago • written 11 months ago by cpad01129.3k

that true, how could modify this script so it can work?

ADD REPLYlink written 11 months ago by parmarsachinsingh0

See my response below. Unless you want the fa as part of the prefix, use a different index to cut with, or better yet, use sed.

ADD REPLYlink written 11 months ago by RamRS18k

I've updated my answer below with some tips.

ADD REPLYlink written 11 months ago by RamRS18k

These are shell questions, my friend - it takes a little trial and error to get them right. for examples, why a cut -c 4- when the common ending (.fa.gz) itself is 6 characters? Only a rev | cut -c 7- will give you a list of prefixes. In fact, you can just use ls *.fa.gz | sed 's/.fa.gz//'

ADD REPLYlink written 11 months ago by RamRS18k

for i in $(ls *.gz | sed s/.fa.gz//) do mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30 ${i} ../Alignments/${i%.fa.gz}.alignments.vdjca; done

hi Now i have changed things as you suggested. but still i say there are no such file or directory. Here i do not understand one point if i am not providing R1 and R2 separately then how will recognize the partner. i am sorry i am bit new to system that why my questions are bit basic level.

ADD REPLYlink written 11 months ago by parmarsachinsingh0

About file name based assumptions, I'm guessing that's on the tool, not on bash. If your tool (mixcr align) substitutes an R1. with a R2. automatically to look for a corresponding file, it can (in your words) recognize the partner. If not, like most other tools, it will have two input parameters, one for each file.

Also, I'm guessing your changing-suffix-after-common-prefix as you describe below is the core challenge.

In your command above, you've sed-substituted away the .fa.gz and still are using ${i%.fa.gz}. Why is that?

ADD REPLYlink written 11 months ago by RamRS18k

It would help if you could post the example code for accepted and intended parameters for this tool (mixcr) with one paired end sample. Otherwise, visitors would not know how input files are passed to the tool and in what order.

ADD REPLYlink written 11 months ago by cpad01129.3k
16
gravatar for RamRS
4.5 years ago by
RamRS18k
Houston, TX
RamRS18k wrote:

So, this is what I came up with, based on our discussion:

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq)
do
tophat2 -o ${i%.fastq}tpout  -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}_R1_001.fastq ${i}_R2_001.fastq --rg-id ${i} --rg-sample ${i} --rg-library rna-seq --rg-platform Illumina
done

Hope this helps!

OK, this answer is gotten a little popular, and my shell knowledge has improved so I'm going to enhance this with some failsafes to address commonly observed challenges.

First off, the rev | cut -c 13- is just asking for trouble. Let's substitute that with a sed so the command makes a little more sense. Here, I'm trying to remove all _R[12]_001.fastq.

ls *.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq looks better. I've also edited the variable name from i so it makes a little more sense. And I've added some formatting changes plus quotes to avoid shell problems.

Here's the updated code:

for prefix in $(ls *.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq)
do
tophat2 -o ${prefix%.fastq}tpout  \
    -G path/to/my/reference.gtf -p 8 \
    path/to/my/bowtie_index \
    "${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq" \
    --rg-id "${prefix}" --rg-sample "${prefix}" --rg-library rna-seq --rg-platform Illumina
done
ADD COMMENTlink modified 11 weeks ago • written 4.5 years ago by RamRS18k
1

Wau that looks good - pleae could you explain the first line -

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq)

Thank you very much,

Paul.

ADD REPLYlink modified 3 months ago by RamRS18k • written 4.5 years ago by Paul1.1k
1

Thank you. Sure, the first line extracts unique prefixes. The pipe does something like this:

  1. Pick all .fastq files
  2. Reverse their names
  3. Pick from char 13 to end
  4. reverse again (as the last 13 characters are constant, we use that to pick first n characters)
  5. take a uniq on this prefix set.

Use this uniq set to run the script :)

Edit: $(command) is an alternate, clearer way of writing `command`

ADD REPLYlink modified 11 weeks ago • written 4.5 years ago by RamRS18k

Paul,

If this helped, I'd appreciate it if you accepted the answer please. Thank you!

--
Ram

ADD REPLYlink modified 3 months ago • written 4.5 years ago by RamRS18k
1

Yeah of course Ram, thank you so much for your time and help!!! 

Paul.

ADD REPLYlink written 4.5 years ago by Paul1.1k
1

Thank you, Paul. It is always fun working on bash! Have a great weekend!

ADD REPLYlink written 4.5 years ago by RamRS18k

+1 for rev

ADD REPLYlink written 3.8 years ago by geek_y8.7k

It's a wonderful piece of our toolkit, isn't it? It is for these reasons that I love Unix!

ADD REPLYlink written 3.7 years ago by RamRS18k

do I have to run this thing from the same directory i have my fastq files ? or I can run it from anywhere ?

ADD REPLYlink written 19 months ago by krushnach80390
1

This script is customized to OP's situation. You may want to modify it to fit your use case. Given the script involves a ls *.fastq, you'll need to either run a valid version of it in the dir with the fastq files, or change the ls part of the script suitably.

ADD REPLYlink written 19 months ago by RamRS18k

so in place of ls i just defined the path that contains my fastq files so that would work?

ADD REPLYlink written 19 months ago by krushnach80390

as i defined the whole path which contains my fastq files its showing that its a directory ..the command is not being executed

ADD REPLYlink written 19 months ago by krushnach80390
1

How well did you understand/customize the script? The cut operates on exact character numbering, and might not be suitable to your case. If you cut the first 13 characters of a long path name, you might end up with a character string that doesn't mean much.

Have fun with bash! :)

ADD REPLYlink written 19 months ago by RamRS18k

Hi,

Just a general question about the arguments used for the loop designed here. What is the difference between ${i%.fastq} used to indicate the output or simply using ${i}?

Thanks!

ADD REPLYlink written 17 months ago by lshepard130
4

I've found this page (and the website in general) very useful for advanced bash programming: http://wiki.bash-hackers.org/syntax/pe

Let's say i="HELLO.txt.txt"

echo $i #HELLO.txt.txt
echo $iX #No output - $iX is not a variable with any value here
echo ${i}X #HELLO.txt.txtX - ${} isolates the string inside the {} as the variable name
echo ${i%.txt} #HELLO.txt - removes shortest match for '.txt' from the end of the string

So, ${i%.fastq} removes the smallest .fastq from the end of the filename (so, its extension). Hope this helps.

EDIT: With context, I'm using the (kinda) basename of the file - the name of the file stripped of its fastq extension to name the output. I'm sure there are other ways to achieve this, maybe with basename or seds, but this fit best here.

ADD REPLYlink modified 17 months ago • written 17 months ago by RamRS18k

Hi Ram, thank you so much for the quick reply. Very helpful information for scripting loops!

ADD REPLYlink written 17 months ago by lshepard130

I think its getting over written in the tpout folder ...can you suggest whats wrong with my loop , im using this have a look

for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq); do tophat2 -o $(%i.fastq)tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38 ${i}_1.fastq ${i}_2.fastq ;done
ADD REPLYlink written 12 months ago by krushnach80390
1

$() executes the statement within the (). ${} is used to delimit variable names. Are you sure you're doing it right with the $(%i.fastq)tout part?

ADD REPLYlink written 12 months ago by RamRS18k

Immediate issue is $(%i} in your code. You should use ${i%}, not ${%i}, to my knowledge, in your code. However your variable depends on your sample names. could you paste your sample names? one for each type (forward and reverse)?

ADD REPLYlink modified 12 months ago • written 12 months ago by cpad01129.3k

Yes, it's always ${} for parameter expansion/ string substitution

ADD REPLYlink written 12 months ago by RamRS18k

okay i'm doing as you mentioned but for this i would like to know "tophat2 -o ${i%.fastq}tpout" is it going to put my output in two different folders , as i have two pairs of paired end read

ADD REPLYlink written 12 months ago by krushnach80390

I don't know how tophat works, I was helping with bash here :-)

ADD REPLYlink written 11 months ago by RamRS18k

Thanks Ram! one of the issue with my samples are that they are coming from barcoding demultiplex that means all the name would be same except number just before .fa.gz such as _1.fa.gz. so how could i give my input sample name such as, example you have provided is common in all the samples so that will work well but in my sample only common part is .fa.gz ${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq"

9_S9_L001_R1_001_1.fa.gz

9_S9_L001_R1_001_2.fa.gz

9_S9_L001_R1_001_3.fa.gz
ADD REPLYlink written 11 months ago by parmarsachinsingh0
1

I'd recommend breaking down your problem into at least 2 steps. For the first step, collect all filenames in a separate file. Once you do that, you can create dry-run commands that just echo stuff so you can test your cutting and sed-ing.

For example, if a file named filelist.list had the 3 file names you gave me, I could do:

sed 's/.fa.gz//' filelist.list | sed -r 's/(_[0-9])$/\t\1/' | tee prefixes_and_suffixes.txt

Output:

9_S9_L001_R1_001    _1
9_S9_L001_R1_001    _2
9_S9_L001_R1_001    _3

You could then use this new tab separated file (the arbitrarily named prefixes_and_suffixes.txt) to do what you will - awk and cut as well as uniq will be really useful here. The only way you'll learn bash is by this sort of trial and error. This will be a good exercise. Good luck!

ADD REPLYlink modified 11 months ago • written 11 months ago by RamRS18k

Life would be so much simpler if you had chosen to include that _1, _2 before _L001_R1_001.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax57k

You can use nested loop as following:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R1"_$j;done;done.

Replace R2 for another read pair in place R1.

My assumption is that, for 2 samples with paired end reads (R1 and R2), your files are :

$ ls *.gz
9_S9_L001_R1_001_1.fa.gz  
9_S9_L001_R2_001_2.fa.gz
9_S9_L001_R2_001_1.fa.gz 
9_S9_L001_R1_001_2.fa.gz

output for R1:

  $ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R1"_$j;done;done
9_S9_L001_R1_1.fa.gz
9_S9_L001_R1_2.fa.gz

output for R2:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R2"_$j;done;done
9_S9_L001_R2_1.fa.gz
9_S9_L001_R2_2.fa.gz

Let us say your program takes following commands and you have several samples with R1 and R2:

$ program -p <R1>.fa <R2>.fa

Above script would be:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do program -p $i"_R1"_$j  $i"_R2"_$j;done;done

Try with two samples at first to see if program is working. Let me know if there are any issues.

ADD REPLYlink modified 11 months ago • written 11 months ago by cpad01129.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1683 users visited in the last hour