Question

Rsubread: Aligning multiple paired-end files

0

Entering edit mode

5.7 years ago

imesi ▴ 20

Hello brain trust, Thanks for reading, please help. I have 18 paired-end files (9 forward and 9 reverse, file extension is .fq, each has a unique name) and I am trying to align them using Rsubread. I am working through a hpc cluster, not in R directly. I slurm/sbatch to submit the R script.

If I align a set of paired-end files individually, everything works well. However, when I try to align the entire 18 files, I get an error message.

Here is my R script for the individual PE files: (this works fine)

align(index="mm10",readfile1="Control-1_R1_001_paired.fq",readfile2="Control-1_R2_001_paired.fq",input_format="FASTQ",output_file="rsubread.bam",output_format="BAM",nthreads=16)

Here is my R script for all of the files: (ps: I edited the path to the directory to "path to file" for de-identification reasons)

align(index="mm10",readfile1="/path to file/forward_reads_list",readfile2="/path to file/reverse_reads_list",input_format="FASTQ",output_file="rsubread.bam",output_format="BAM",nthreads=16)

The output_file "rsubread.bam" does not get listed in the working directory. For the slurm/sbatch, there is no error message as the .err file is empty. However, my output, .out is below: (Again, I edited the path to the directory to "path to file" for de-identification reasons).

 Function      : Read alignment (RNA-Seq)                                   ||
|| Input file 1  : /path to file/forward_ ... ||
|| Input file 2  : /path to file/reverse_ ... ||
|| Output file   : rsubread.bam (BAM)                                         ||
|| Index name    : mm10                                                       ||
||                                                                            ||
||                    ------------------------------------                    ||
||                                                                            ||
||                       Threads : 16                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : no                                         ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================ Running (22-Aug-2018 06:13:12, pid=36855) =================\\
||                                                                            ||
|| WARNING format issue in file '/path to file/ ... ||
||         The required format is : FASTQ or FASTA                            ||
||         The file format is unknown.                                        ||
|| A wrong format may result in wrong results or crash the program.           ||
|| Please refer to the manual for file format options.                        ||
|| If the file is in the correct format, please ignore this message.          ||
||                                                                            ||
|| WARNING format issue in file '/path to file/ ... ||
||         The required format is : FASTQ or FASTA                            ||
||         The file format is unknown.                                        ||
|| A wrong format may result in wrong results or crash the program.           ||
|| Please refer to the manual for file format options.                        ||
|| If the file is in the correct format, please ignore this message.

If you are still reading at this point, thank you :). Any ideas on how to solve this?

rna-seq alignment Rsubread • 7.8k views

ADD COMMENT • link 5.7 years ago by imesi ▴ 20

0

Entering edit mode

Apologies for the delay. I was barred from responding because apparently their is a 5-post in 6 hours limit for newcomers. Had no idea. Here is my script. I slurm submit the script in unix, so it doesn't run on the command line directly.

#!/usr/bin/env Rscript
library(Rsubread)

reads1 <- list.files( path = "/path/to/file/forward_reads_list", pattern = "*_R1_001.fq$" )
reads2 <- list.files( path = "/path/to/file/reverse_reads_list", pattern = "*_R2_001.fq$" )
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

Thanks.

ADD REPLY • link updated 5.7 years ago by GenoMax 141k • written 5.7 years ago by imesi ▴ 20

0

Entering edit mode

Hello, when I run this script I get the ERROR: unable to open file 'NA'. File name might be incorrect, or you do not have the permission to read the file.

ibrary(Rsubread)
reads1 <- list.files( path="/home/weiyan/Desktop/RNA_SEQ/files/fastq/trimmed/rif1", pattern = "*_1.fq.gz$" )
reads2 <- list.files( path="/home/weiyan/Desktop/RNA_SEQ/files/fastq/trimmed/rif1", pattern = "*_2.fq.gz$" )
all.equal(length(reads1),length(reads2))
align(index="mm10",readfile1=reads1, readfile2=reads2, input_format="gzFASTQ", 
output_file=paste(as.character(reads1)), output_format="BAM", nthreads=20)

Any suggestions? Thanks!

ADD REPLY • link 5.1 years ago by weiyanjia2008 ▴ 30

score 2 · Answer 1 · 2018-08-22

2

Entering edit mode

5.7 years ago

Santosh Anand 5.7k

According to the documentation https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/align

readfile1: a character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. See input_format option for more supported formats.

readfile2: a character vector giving names of files that include second reads in paired-end read data. Files included in readfile2 should be in the same order as their mate files included in readfile1 .NULL by default.

So you need to pass the names of files explicitly, not a list (or directory name).

ADD COMMENT • link 5.7 years ago by Santosh Anand 5.7k

0

Entering edit mode

Thanks Santosh. Each of my files has a unique name, I guess I can write them in as *_R1_001.fq and *_R2_001.fq. I have tried that, but no luck.

ADD REPLY • link 5.7 years ago by imesi ▴ 20

1

Entering edit mode

No, you can not write *_R1_001.fq and *_R2_001.fq with the align() function. R is not a Unix shell, and glob expansions do not work generally.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

@hmon is very right.. R will not expand the globs (*, ? etc.) automatically. Instead you need to use list.files

list.files(pattern="*_R1_001.fq")

ADD REPLY • link 5.7 years ago by Santosh Anand 5.7k

h.mon · Answer 2 · 2018-08-22

1

Entering edit mode

5.7 years ago

h.mon 35k

I find using Rsubread cumbersome and it adds R overhead to an already heavy task. In addition, I had experienced Rsubread crashes, where subread-align is rock-solid.

You have to pass a vector with all file names, and either leave out output_file=, or pass a vector of output file names:

reads1 <- list.files( path = "/path/to/files", pattern = "*_1.fastq$" )
reads2 <- list.files( path = "/path/to/files", pattern = "*_2.fastq$" )
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

ADD COMMENT • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Thank you h.mon I just ran your suggested script and this is the error:

Error: unexpected string constant in: " reads1 <- list.files( path = "/path/to/files", pattern = ""

ADD REPLY • link 5.7 years ago by imesi ▴ 20

0

Entering edit mode

You have to adjust the script to match file names you have. _R1_001.fq$ instead of _1.fastq$ as in example. Did you do that?

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Eeek! I did not. Thanks!

ADD REPLY • link 5.7 years ago by imesi ▴ 20

0

Entering edit mode

You anonymize too much your commands and error messages, and becomes difficult to help. You probably need pattern = "*_R1_001_paired.fq" and pattern = "*_R2_001_paired.fq", or something similar.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

True, h.mon, I am paranoid like that, probably because of healthcare privacy laws. I realize how that can be a hindrance in bioinformatics. I will do better. Thanks.

So I reran the script, and the error message is:

Error in align(index = "mm10", readfile1 = reads1, readfile2 = reads2, :

The number of input file names is different from the number of output file names.

Execution halted

This is the order of my files in the list forward_reads_list

Control-1_R1_001_paired.fq
Control-2_R1_001_paired.fq
Control-3_R1_001_paired.fq
IFNg-1_R1_001_paired.fq
IFNg-2_R1_001_paired.fq
IFNg-3_R1_001_paired.fq
M10_R1_001_paired.fq
M6_R1_001_paired.fq
M7_R1_001_paired.fq

reverse_reads_list

Control-1_R2_001_paired.fq
Control-2_R2_001_paired.fq
Control-3_R2_001_paired.fq
IFNg-1_R2_001_paired.fq
IFNg-2_R2_001_paired.fq
IFNg-3_R2_001_paired.fq
M10_R2_001_paired.fq
M6_R2_001_paired.fq
M7_R2_001_paired.fq

ADD REPLY • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by imesi ▴ 20

1

Entering edit mode

Can you copy / paste the command you run? Did you use the output_file= parameter?

ADD REPLY • link 5.7 years ago by h.mon 35k

1

Entering edit mode

check if lengths of your lists are same and you are using latest Rsubread package. Try following (change the path appropriately):

reads1 <- list.files(path=".",pattern = "_R1_.*.fq$" )
reads2 <- list.files(path=".",pattern = "_R2_.*.fq$" )
all.equal(length(reads1),length(reads2))

library(Rsubread)
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=6).

From the error "The file format is unknown." , it seems to me there is an error in one of your files. Validate your fastq files and/or make sure that no other files exist in the raw data directories other than fq/fastq files.

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

Thanks a gazillion cpad0112, it is running!!!! I am doing my happy dance! This is the script:

reads1 <- list.files(path=".",pattern = "_R1_.*.fq$" )
reads2 <- list.files(path=".",pattern = "_R2_.*.fq$" )
all.equal(length(reads1),length(reads2))
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

Thank you hmon, genomax and Santosh Anand, you guys are awesome!

ADD REPLY • link updated 5.7 years ago by GenoMax 141k • written 5.7 years ago by imesi ▴ 20

0

Entering edit mode

good to know that things are working. If script is running in current directory, then you can as well remove, path="." to shorten further. By default, R searches in current directory. imesi

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

It is running in the current directory, but I won't touch it: if ain't broke..... I tried t modify the script to specify an output_file per the manual but it failed to run.

ADD REPLY • link 5.7 years ago by imesi ▴ 20

0

Entering edit mode

Okay. I agree with you. If it is not broken, don't fix it. Did the script run successfully till the end?

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

It is still running and will for a little while. I have 9 files and each is about 170 million reads. The first 2 files were completed in about 280 minutes each. I guess I am looking at a total run time of over a day.

ADD REPLY • link 5.7 years ago by imesi ▴ 20