Run mutltiple Fastq files on Hisat2
2
5
Entering edit mode
8.0 years ago
V ▴ 410

Hello,

I've got a large number of fastq files generated from a paired end single cell RNAseq experiment.

I'm looking to align them back to mm10 using Hisat2, I can do this if I run every pair individually but is there a way to get hisat to do them all with one code one after the other?

I'm relatively new to the ubuntu coding environment so feel free to dumb it down. Thanks in advance

Hisat2 hisat RNAseq • 14k views
ADD COMMENT
0
Entering edit mode

The different files are different samples, right?

You can use for loops in bash, pretty useful. If you can share the naming pattern of your files (R1, R2) and what else we can get this done easily.

ADD REPLY
0
Entering edit mode

hey, thanks for your help. Yes, two files per sample one for the forward one for the reverse. Their naming is quite complicated,

sample 1 F: WTCHG_272965_701502_01 sample 1 R: WTCHG_272965_701502_02

sample 2 F: WTCHG_272965_701503_01 sample 2 R: WTCHG_272965_701503_02

etc

ADD REPLY
0
Entering edit mode

Alright, that's not too complicated :p

So WTCHG_272965_701502_01.fastq is a filename? Or WTCHG_272965_701502_01.fastq.gz?

ADD REPLY
0
Entering edit mode

yes sorry, WTCHG_272965_701502_01.fastq.gz is the file name for the forward read

ADD REPLY
3
Entering edit mode

If I didn't make a mistake the loop would look like this:

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
hisat2 with inputfiles ${f}_01.fastq.gz ${f}_02.fastq.gz and output ${f}.bam
done

Perhaps you should test it first using a few echo statements:

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
echo ${f}_01.fastq.gz
echo ${f}.bam
done
ADD REPLY
1
Entering edit mode

okay, I've tried a number of things based on the above but can't seem to get anything to work properly.

I've created a new folder which contains the mm10 indexes (mm10idx -8 files) and 4 fastq.gz files which are the forward and reverse reads of 2 samples. Following the above naming pattern.

Changed my directory to that folder and run the script you've written but nothing seems to be happening...need some more help please? Maybe I've omitted something obvious. I know that if I was running it on one file I would type :

hisat2 -x mm10idx -1 forewardreads.fastq.gz -2 reversereads.fastq.gz -S example.bam

so can't see how what you've written relates to this or how it can be executed on loop. :/

ADD REPLY
0
Entering edit mode

I assume you know that hisat2 with inputfiles ${f}_01.fastq.gz ${f}_02.fastq.gz and output ${f}.bam was still something you had to fill in? That would then be: (based on your example for a single file)

hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2 ${f}_02.fastq.gz -S ${f}.bam

Did you check if the example with echo did show the expected input and output files?

ADD REPLY
0
Entering edit mode

Yes did so but the echo comes back wrong as shown in my reply below, with fastq appearing at the end of the file names twice and the same file being used for forward and reverse reads. I've left your first line unchanged.

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
ADD REPLY
0
Entering edit mode

thank you for all your help!!! I will try it out and let you know how it goes!

ADD REPLY
5
Entering edit mode
8.0 years ago
GenoMax 147k

Try this instead and let us know.

for f in `ls -1 *_1.fastq.gz | sed 's/_1.fastq.gz//' `
do
echo hisat2 -x mm10idx -1 ${f}_1.fastq.gz -2 ${f}_2.fastq.gz -S ${f}.bam
done
ADD COMMENT
3
Entering edit mode
8.0 years ago

if aligning them all in a single mapping step is an option, then you could try

zcat *_01.fastq.gz | gzip > pairs1.fq.gz
zcat *_02.fastq.gz | gzip > pairs2.fq.gz
hisat2 -x /path/to/hg19/indices -1 pairs1.fq.gz -2 pairs2.fq.gz | samtools view -Sb - > output.bam
ADD COMMENT
0
Entering edit mode

Can I ask a naive question? So If I get this right, what the command you wrote does, is merge all of the forward reads of all files (which are labeled with *_01) and all the reads of the reverse files, and then aligns them to their equivalent?

Also, its finished running but has output one file only as "output.bam" which I can't seem to be able to open or anything....

ADD REPLY
0
Entering edit mode

You should not merge files for multiple samples together. That should only be done for file pieces belonging to a single sample.

Take a look at this Up-to-date RNA-Seq Analysis Training/Courses/Papers (Oct 2016) to understand the basics of RNAseq analysis. @Sean Davis has a dedicated single cell RNAseq resources compilation.

ADD REPLY
0
Entering edit mode

Hello,

Thanks for the links. I've done RNAseq analysis before so know the files etc and the overall workflow but it was all done on galaxy or on R so now struggle abit with the command line in ubuntu. I know what packages to use after I get the bam files but it's doing the alignment the issue. I just need a command that would loop to continue aligning the different files.

ADD REPLY
1
Entering edit mode

@Wouter's example above allows you to iterate over the files. Try the loop below to see if the commands look reasonable. Then remove echo to actually run the jobs. You will need to adjust full/relative paths for index and input files.

for f in `ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u`
do
echo hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2 ${f}_02.fastq.gz -S ${f}.bam
done
ADD REPLY
0
Entering edit mode

so the echo seems wrong to me as it appears to be using the same file for forward and reverse reads. P,S the files end in _1.fastq.gz not in _01 so i've changed it in the input code. Also it's added fastq.gz twich at the end of the file name this is what comes out:

hisat2 -x mm10idx -1 WTCHG_284763_229_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_1.fastq.gz_2.fastq.gz -S WTCHG_284763_229_1.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_229_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_2.fastq.gz_2.fastq.gz -S WTCHG_284763_229_2.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_230_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_1.fastq.gz_2.fastq.gz -S WTCHG_284763_230_1.fastq.gz.bam
hisat2 -x mm10idx -1 WTCHG_284763_230_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_2.fastq.gz_2.fastq.gz -S WTCHG_284763_230_2.fastq.gz.bam
ADD REPLY
1
Entering edit mode

Earlier you wrote the files are WTCHG_272965_701502_01.fastq.gz ;-) That's why my sed doesn't provide the expected ${f} variable. Genomax2's solution probably solves that.

ADD REPLY
0
Entering edit mode

works like a charm! Thank you both for your help and patience!!

ADD REPLY
0
Entering edit mode

yes, these commands do merge both types' all files and then map them in a single mapping step if aligning them all in a single mapping step is an option, outputting a single bam file. but note that, as @genomax2 pointed out, merging fastq files from different samples isn't recommended.

when I read "a large number of fastq files generated from a paired end single cell RNAseq experiment" I thought that you could be having different fastq pairs for the same sample, but if each pair comes from a different sample you should rather go for the looping solutions pointed out by @genomax2 and @Wouter.

ADD REPLY
0
Entering edit mode

Thanks! Is the output.bam sorted and indexed? If not, is there a command to run so that the output.bam is sorted and indexed?

ADD REPLY
1
Entering edit mode

No the BAM is not sorted but you could do something like (with latest samtools):

hisat2 -x /path/to/hg19/indices -1 pairs1.fq.gz -2 pairs2.fq.gz | samtools sort --write-index -o sorted.bam -

This should sort the result BAM file and write an index at the same time.

ADD REPLY

Login before adding your answer.

Traffic: 2358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6