Run mutltiple Fastq files on Hisat2
2
5
Entering edit mode
4.8 years ago
V ▴ 320

Hello,

I've got a large number of fastq files generated from a paired end single cell RNAseq experiment.

I'm looking to align them back to mm10 using Hisat2, I can do this if I run every pair individually but is there a way to get hisat to do them all with one code one after the other?

I'm relatively new to the ubuntu coding environment so feel free to dumb it down. Thanks in advance

Hisat2 hisat RNAseq • 8.5k views
0
Entering edit mode

The different files are different samples, right?

You can use for loops in bash, pretty useful. If you can share the naming pattern of your files (R1, R2) and what else we can get this done easily.

0
Entering edit mode

hey, thanks for your help. Yes, two files per sample one for the forward one for the reverse. Their naming is quite complicated,

sample 1 F: WTCHG_272965_701502_01 sample 1 R: WTCHG_272965_701502_02

sample 2 F: WTCHG_272965_701503_01 sample 2 R: WTCHG_272965_701503_02

etc

0
Entering edit mode

Alright, that's not too complicated :p

So WTCHG_272965_701502_01.fastq is a filename? Or WTCHG_272965_701502_01.fastq.gz?

0
Entering edit mode

yes sorry, WTCHG_272965_701502_01.fastq.gz is the file name for the forward read

3
Entering edit mode

If I didn't make a mistake the loop would look like this:

for f in ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u
do
hisat2 with inputfiles ${f}_01.fastq.gz${f}_02.fastq.gz and output ${f}.bam done  Perhaps you should test it first using a few echo statements: for f in ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u do echo${f}_01.fastq.gz
echo ${f}.bam done  ADD REPLY 0 Entering edit mode thank you for all your help!!! I will try it out and let you know how it goes! ADD REPLY 0 Entering edit mode okay, I've tried a number of things based on the above but can't seem to get anything to work properly. I've created a new folder which contains the mm10 indexes (mm10idx -8 files) and 4 fastq.gz files which are the forward and reverse reads of 2 samples. Following the above naming pattern. Changed my directory to that folder and run the script you've written but nothing seems to be happening...need some more help please? Maybe I've omitted something obvious. I know that if I was running it on one file I would type : hisat2 -x mm10idx -1 forewardreads.fastq.gz -2 reversereads.fastq.gz -S example.bam  so can't see how what you've written relates to this or how it can be executed on loop. :/ ADD REPLY 0 Entering edit mode I assume you know that hisat2 with inputfiles${f}_01.fastq.gz ${f}_02.fastq.gz and output${f}.bam was still something you had to fill in? That would then be: (based on your example for a single file)

hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2${f}_02.fastq.gz -S ${f}.bam  Did you check if the example with echo did show the expected input and output files? ADD REPLY 0 Entering edit mode Yes did so but the echo comes back wrong as shown in my reply below, with fastq appearing at the end of the file names twice and the same file being used for forward and reverse reads. I've left your first line unchanged. for f in ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u do  ADD REPLY 4 Entering edit mode 4.8 years ago GenoMax 106k Try this instead and let us know. for f in ls -1 *_1.fastq.gz | sed 's/_1.fastq.gz//'  do echo hisat2 -x mm10idx -1${f}_1.fastq.gz -2 ${f}_2.fastq.gz -S${f}.bam
done

2
Entering edit mode
4.8 years ago

if aligning them all in a single mapping step is an option, then you could try

zcat *_01.fastq.gz | gzip > pairs1.fq.gz
zcat *_02.fastq.gz | gzip > pairs2.fq.gz
hisat2 -x /path/to/hg19/indices -1 pairs1.fq.gz -2 pairs2.fq.gz | samtools view -Sb - > output.bam

0
Entering edit mode

Can I ask a naive question? So If I get this right, what the command you wrote does, is merge all of the forward reads of all files (which are labeled with *_01) and all the reads of the reverse files, and then aligns them to their equivalent?

Also, its finished running but has output one file only as "output.bam" which I can't seem to be able to open or anything....

0
Entering edit mode

You should not merge files for multiple samples together. That should only be done for file pieces belonging to a single sample.

Take a look at this Up-to-date RNA-Seq Analysis Training/Courses/Papers (Oct 2016) to understand the basics of RNAseq analysis. @Sean Davis has a dedicated single cell RNAseq resources compilation.

0
Entering edit mode

Hello,

Thanks for the links. I've done RNAseq analysis before so know the files etc and the overall workflow but it was all done on galaxy or on R so now struggle abit with the command line in ubuntu. I know what packages to use after I get the bam files but it's doing the alignment the issue. I just need a command that would loop to continue aligning the different files.

1
Entering edit mode

@Wouter's example above allows you to iterate over the files. Try the loop below to see if the commands look reasonable. Then remove echo to actually run the jobs. You will need to adjust full/relative paths for index and input files.

for f in ls *.fastq.gz | sed 's/_0[12].fastq.gz//g' | sort -u
do
echo hisat2 -x mm10idx -1 ${f}_01.fastq.gz -2${f}_02.fastq.gz -S ${f}.bam done  ADD REPLY 0 Entering edit mode so the echo seems wrong to me as it appears to be using the same file for forward and reverse reads. P,S the files end in _1.fastq.gz not in _01 so i've changed it in the input code. Also it's added fastq.gz twich at the end of the file name this is what comes out: hisat2 -x mm10idx -1 WTCHG_284763_229_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_1.fastq.gz_2.fastq.gz -S WTCHG_284763_229_1.fastq.gz.bam hisat2 -x mm10idx -1 WTCHG_284763_229_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_229_2.fastq.gz_2.fastq.gz -S WTCHG_284763_229_2.fastq.gz.bam hisat2 -x mm10idx -1 WTCHG_284763_230_1.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_1.fastq.gz_2.fastq.gz -S WTCHG_284763_230_1.fastq.gz.bam hisat2 -x mm10idx -1 WTCHG_284763_230_2.fastq.gz_1.fastq.gz -2 WTCHG_284763_230_2.fastq.gz_2.fastq.gz -S WTCHG_284763_230_2.fastq.gz.bam  ADD REPLY 1 Entering edit mode Earlier you wrote the files are WTCHG_272965_701502_01.fastq.gz ;-) That's why my sed doesn't provide the expected${f} variable. Genomax2's solution probably solves that.

0
Entering edit mode

works like a charm! Thank you both for your help and patience!!

0
Entering edit mode

yes, these commands do merge both types' all files and then map them in a single mapping step if aligning them all in a single mapping step is an option, outputting a single bam file. but note that, as @genomax2 pointed out, merging fastq files from different samples isn't recommended.

when I read "a large number of fastq files generated from a paired end single cell RNAseq experiment" I thought that you could be having different fastq pairs for the same sample, but if each pair comes from a different sample you should rather go for the looping solutions pointed out by @genomax2 and @Wouter.