Question: Run mutltiple Fastq files on Hisat2
2
gravatar for V
9 months ago by
V50
UK/London
V50 wrote:

Hello,

I've got a large number of fastq files generated from a paired end single cell RNAseq experiment.

I'm looking to align them back to mm10 using Hisat2, I can do this if I run every pair individually but is there a way to get hisat to do them all with one code one after the other?

I'm relatively new to the ubuntu coding environment so feel free to dumb it down. Thanks in advance

hisat2 hisat rnaseq • 1.3k views
ADD COMMENTlink modified 9 months ago by Jorge Amigo9.8k • written 9 months ago by V50

The different files are different samples, right?

You can use for loops in bash, pretty useful. If you can share the naming pattern of your files (R1, R2) and what else we can get this done easily.

ADD REPLYlink written 9 months ago by WouterDeCoster20k

hey, thanks for your help. Yes, two files per sample one for the forward one for the reverse. Their naming is quite complicated,

sample 1 F: WTCHG_272965_701502_01 sample 1 R: WTCHG_272965_701502_02

sample 2 F: WTCHG_272965_701503_01 sample 2 R: WTCHG_272965_701503_02

etc

ADD REPLYlink written 9 months ago by V50

Alright, that's not too complicated :p

So WTCHG_272965_701502_01.fastq is a filename? Or WTCHG_272965_701502_01.fastq.gz?

ADD REPLYlink written 9 months ago by WouterDeCoster20k

yes sorry, WTCHG_272965_701502_01.fastq.gz is the file name for the forward read

ADD REPLYlink written 9 months ago by V50
2

If I didn't make a mistake the loop would look like this:

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
hisat2 with inputfiles ${f}_01.fastq.gz ${f}_02.fastq.gz and output ${f}.bam
done

Perhaps you should test it first using a few echo statements:

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
echo ${f}_01.fastq.gz
echo ${f}.bam
done
ADD REPLYlink modified 9 months ago • written 9 months ago by WouterDeCoster20k

thank you for all your help!!! I will try it out and let you know how it goes!

ADD REPLYlink written 9 months ago by V50

okay, I've tried a number of things based on the above but can't seem to get anything to work properly.

I've created a new folder which contains the mm10 indexes (mm10idx -8 files) and 4 fastq.gz files which are the forward and reverse reads of 2 samples. Following the above naming pattern.

Changed my directory to that folder and run the script you've written but nothing seems to be happening...need some more help please? Maybe I've omitted something obvious. I know that if I was running it on one file I would type :

hisat2 -x mm10idx -1 forewardreads.fastq.gz -2 reversereads.fastq.gz -S example.bam

so can't see how what you've written relates to this or how it can be executed on loop. :/

ADD REPLYlink modified 8 months ago • written 8 months ago by V50

I assume you know that hisat2 with inputfiles ${f}_01.fastq.gz ${f}_02.fastq.gz and output ${f}.bam was still something you had to fill in? That would then be: (based on your example for a single file)

hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2 ${f}_02.fastq.gz -S ${f}.bam

Did you check if the example with echo did show the expected input and output files?

ADD REPLYlink written 8 months ago by WouterDeCoster20k

Yes did so but the echo comes back wrong as shown in my reply below, with fastq appearing at the end of the file names twice and the same file being used for forward and reverse reads. I've left your first line unchanged.

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
ADD REPLYlink written 8 months ago by V50
1
gravatar for genomax
8 months ago by
genomax32k
United States
genomax32k wrote:

Try this instead and let us know.

for f in `ls -1 *_1.fastq.gz | sed 's/_1.fastq.gz//' `
do
echo hisat2 -x mm10idx -1 ${f}_1.fastq.gz -2 ${f}_2.fastq.gz -S ${f}.bam
done
ADD COMMENTlink modified 8 months ago • written 8 months ago by genomax32k
1
gravatar for Jorge Amigo
9 months ago by
Jorge Amigo9.8k
Santiago de Compostela, Spain
Jorge Amigo9.8k wrote:

if aligning them all in a single mapping step is an option, then you could try

zcat *_01.fastq.gz | gzip > pairs1.fq.gz
zcat *_02.fastq.gz | gzip > pairs2.fq.gz
hisat2 -x /path/to/hg19/indices -1 pairs1.fq.gz -2 pairs2.fq.gz | samtools view -Sb - > output.bam
ADD COMMENTlink modified 9 months ago • written 9 months ago by Jorge Amigo9.8k

Can I ask a naive question? So If I get this right, what the command you wrote does, is merge all of the forward reads of all files (which are labeled with *_01) and all the reads of the reverse files, and then aligns them to their equivalent?

Also, its finished running but has output one file only as "output.bam" which I can't seem to be able to open or anything....

ADD REPLYlink modified 8 months ago • written 8 months ago by V50

You should not merge files for multiple samples together. That should only be done for file pieces belonging to a single sample.

Take a look at this Up-to-date RNA-Seq Analysis Training/Courses/Papers (Oct 2016) to understand the basics of RNAseq analysis. @Sean Davis has a dedicated single cell RNAseq resources compilation.

ADD REPLYlink written 8 months ago by genomax32k

Hello,

Thanks for the links. I've done RNAseq analysis before so know the files etc and the overall workflow but it was all done on galaxy or on R so now struggle abit with the command line in ubuntu. I know what packages to use after I get the bam files but it's doing the alignment the issue. I just need a command that would loop to continue aligning the different files.

ADD REPLYlink written 8 months ago by V50
1

@Wouter's example above allows you to iterate over the files. Try the loop below to see if the commands look reasonable. Then remove echo to actually run the jobs. You will need to adjust full/relative paths for index and input files.

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
echo hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2 ${f}_02.fastq.gz -S ${f}.bam
done
ADD REPLYlink modified 8 months ago • written 8 months ago by genomax32k

so the echo seems wrong to me as it appears to be using the same file for forward and reverse reads. P,S the files end in _1.fastq.gz not in _01 so i've changed it in the input code. Also it's added fastq.gz twich at the end of the file name this is what comes out:

hisat2 -x mm10idx -1 WTCHG_284763_229_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_1.fastq.gz_2.fastq.gz -S WTCHG_284763_229_1.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_229_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_2.fastq.gz_2.fastq.gz -S WTCHG_284763_229_2.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_230_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_1.fastq.gz_2.fastq.gz -S WTCHG_284763_230_1.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_230_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_2.fastq.gz_2.fastq.gz -S WTCHG_284763_230_2.fastq.gz.bam
ADD REPLYlink modified 8 months ago • written 8 months ago by V50
1

Earlier you wrote the files are WTCHG_272965_701502_01.fastq.gz ;-) That's why my sed doesn't provide the expected ${f} variable. Genomax2's solution probably solves that.

ADD REPLYlink modified 8 months ago • written 8 months ago by WouterDeCoster20k

works like a charm! Thank you both for your help and patience!!

ADD REPLYlink written 8 months ago by V50

yes, these commands do merge both types' all files and then map them in a single mapping step if aligning them all in a single mapping step is an option, outputting a single bam file. but note that, as @genomax2 pointed out, merging fastq files from different samples isn't recommended.

when I read "a large number of fastq files generated from a paired end single cell RNAseq experiment" I thought that you could be having different fastq pairs for the same sample, but if each pair comes from a different sample you should rather go for the looping solutions pointed out by @genomax2 and @Wouter.

ADD REPLYlink written 8 months ago by Jorge Amigo9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 759 users visited in the last hour