Question: Rsubread: Aligning multiple paired-end files
0
gravatar for imesi
8 months ago by
imesi20
imesi20 wrote:

Hello brain trust, Thanks for reading, please help. I have 18 paired-end files (9 forward and 9 reverse, file extension is .fq, each has a unique name) and I am trying to align them using Rsubread. I am working through a hpc cluster, not in R directly. I slurm/sbatch to submit the R script.

If I align a set of paired-end files individually, everything works well. However, when I try to align the entire 18 files, I get an error message.

Here is my R script for the individual PE files: (this works fine)

align(index="mm10",readfile1="Control-1_R1_001_paired.fq",readfile2="Control-1_R2_001_paired.fq",input_format="FASTQ",output_file="rsubread.bam",output_format="BAM",nthreads=16)

Here is my R script for all of the files: (ps: I edited the path to the directory to "path to file" for de-identification reasons)

align(index="mm10",readfile1="/path to file/forward_reads_list",readfile2="/path to file/reverse_reads_list",input_format="FASTQ",output_file="rsubread.bam",output_format="BAM",nthreads=16)

The output_file "rsubread.bam" does not get listed in the working directory. For the slurm/sbatch, there is no error message as the .err file is empty. However, my output, .out is below: (Again, I edited the path to the directory to "path to file" for de-identification reasons).

 Function      : Read alignment (RNA-Seq)                                   ||
|| Input file 1  : /path to file/forward_ ... ||
|| Input file 2  : /path to file/reverse_ ... ||
|| Output file   : rsubread.bam (BAM)                                         ||
|| Index name    : mm10                                                       ||
||                                                                            ||
||                    ------------------------------------                    ||
||                                                                            ||
||                       Threads : 16                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : no                                         ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================ Running (22-Aug-2018 06:13:12, pid=36855) =================\\
||                                                                            ||
|| WARNING format issue in file '/path to file/ ... ||
||         The required format is : FASTQ or FASTA                            ||
||         The file format is unknown.                                        ||
|| A wrong format may result in wrong results or crash the program.           ||
|| Please refer to the manual for file format options.                        ||
|| If the file is in the correct format, please ignore this message.          ||
||                                                                            ||
|| WARNING format issue in file '/path to file/ ... ||
||         The required format is : FASTQ or FASTA                            ||
||         The file format is unknown.                                        ||
|| A wrong format may result in wrong results or crash the program.           ||
|| Please refer to the manual for file format options.                        ||
|| If the file is in the correct format, please ignore this message.

If you are still reading at this point, thank you :). Any ideas on how to solve this?

rna-seq rsubread alignment • 846 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by imesi20

Apologies for the delay. I was barred from responding because apparently their is a 5-post in 6 hours limit for newcomers. Had no idea. Here is my script. I slurm submit the script in unix, so it doesn't run on the command line directly.

#!/usr/bin/env Rscript
library(Rsubread)

reads1 <- list.files( path = "/path/to/file/forward_reads_list", pattern = "*_R1_001.fq$" )
reads2 <- list.files( path = "/path/to/file/reverse_reads_list", pattern = "*_R2_001.fq$" )
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

Thanks.

ADD REPLYlink modified 8 months ago by genomax65k • written 8 months ago by imesi20

Hello, when I run this script I get the ERROR: unable to open file 'NA'. File name might be incorrect, or you do not have the permission to read the file.

ibrary(Rsubread)
reads1 <- list.files( path="/home/weiyan/Desktop/RNA_SEQ/files/fastq/trimmed/rif1", pattern = "*_1.fq.gz$" )
reads2 <- list.files( path="/home/weiyan/Desktop/RNA_SEQ/files/fastq/trimmed/rif1", pattern = "*_2.fq.gz$" )
all.equal(length(reads1),length(reads2))
align(index="mm10",readfile1=reads1, readfile2=reads2, input_format="gzFASTQ", 
output_file=paste(as.character(reads1)), output_format="BAM", nthreads=20)

Any suggestions? Thanks!

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by weiyanjia200820
2
gravatar for Santosh Anand
8 months ago by
Santosh Anand4.7k
Santosh Anand4.7k wrote:

According to the documentation https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/align

readfile1: a character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. See input_format option for more supported formats.

readfile2: a character vector giving names of files that include second reads in paired-end read data. Files included in readfile2 should be in the same order as their mate files included in readfile1 .NULL by default.

So you need to pass the names of files explicitly, not a list (or directory name).

ADD COMMENTlink written 8 months ago by Santosh Anand4.7k

Thanks Santosh. Each of my files has a unique name, I guess I can write them in as *_R1_001.fq and *_R2_001.fq. I have tried that, but no luck.

ADD REPLYlink written 8 months ago by imesi20
1

No, you can not write *_R1_001.fq and *_R2_001.fq with the align() function. R is not a Unix shell, and glob expansions do not work generally.

ADD REPLYlink written 8 months ago by h.mon24k

@hmon is very right.. R will not expand the globs (*, ? etc.) automatically. Instead you need to use list.files

list.files(pattern="*_R1_001.fq")
ADD REPLYlink written 8 months ago by Santosh Anand4.7k
1
gravatar for h.mon
8 months ago by
h.mon24k
Brazil
h.mon24k wrote:

I find using Rsubread cumbersome and it adds R overhead to an already heavy task. In addition, I had experienced Rsubread crashes, where subread-align is rock-solid.

You have to pass a vector with all file names, and either leave out output_file=, or pass a vector of output file names:

reads1 <- list.files( path = "/path/to/files", pattern = "*_1.fastq$" )
reads2 <- list.files( path = "/path/to/files", pattern = "*_2.fastq$" )
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)
ADD COMMENTlink modified 8 months ago • written 8 months ago by h.mon24k

Thank you h.mon I just ran your suggested script and this is the error:

Error: unexpected string constant in: " reads1 <- list.files( path = "/path/to/files", pattern = ""

ADD REPLYlink written 8 months ago by imesi20

You have to adjust the script to match file names you have. _R1_001.fq$ instead of _1.fastq$ as in example. Did you do that?

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax65k

Eeek! I did not. Thanks!

ADD REPLYlink written 8 months ago by imesi20

You anonymize too much your commands and error messages, and becomes difficult to help. You probably need pattern = "*_R1_001_paired.fq" and pattern = "*_R2_001_paired.fq", or something similar.

ADD REPLYlink modified 8 months ago • written 8 months ago by h.mon24k

True, h.mon, I am paranoid like that, probably because of healthcare privacy laws. I realize how that can be a hindrance in bioinformatics. I will do better. Thanks.

So I reran the script, and the error message is:

Error in align(index = "mm10", readfile1 = reads1, readfile2 = reads2, :

The number of input file names is different from the number of output file names.

Execution halted

This is the order of my files in the list forward_reads_list

Control-1_R1_001_paired.fq
Control-2_R1_001_paired.fq
Control-3_R1_001_paired.fq
IFNg-1_R1_001_paired.fq
IFNg-2_R1_001_paired.fq
IFNg-3_R1_001_paired.fq
M10_R1_001_paired.fq
M6_R1_001_paired.fq
M7_R1_001_paired.fq

reverse_reads_list

Control-1_R2_001_paired.fq
Control-2_R2_001_paired.fq
Control-3_R2_001_paired.fq
IFNg-1_R2_001_paired.fq
IFNg-2_R2_001_paired.fq
IFNg-3_R2_001_paired.fq
M10_R2_001_paired.fq
M6_R2_001_paired.fq
M7_R2_001_paired.fq
ADD REPLYlink modified 8 months ago by h.mon24k • written 8 months ago by imesi20
1

Can you copy / paste the command you run? Did you use the output_file= parameter?

ADD REPLYlink written 8 months ago by h.mon24k
1

check if lengths of your lists are same and you are using latest Rsubread package. Try following (change the path appropriately):

reads1 <- list.files(path=".",pattern = "_R1_.*.fq$" )
reads2 <- list.files(path=".",pattern = "_R2_.*.fq$" )
all.equal(length(reads1),length(reads2))

library(Rsubread)
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=6).

From the error "The file format is unknown." , it seems to me there is an error in one of your files. Validate your fastq files and/or make sure that no other files exist in the raw data directories other than fq/fastq files.

ADD REPLYlink modified 8 months ago • written 8 months ago by cpad011211k

Thanks a gazillion cpad0112, it is running!!!! I am doing my happy dance! This is the script:

reads1 <- list.files(path=".",pattern = "_R1_.*.fq$" )
reads2 <- list.files(path=".",pattern = "_R2_.*.fq$" )
all.equal(length(reads1),length(reads2))
align(index="mm10",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

Thank you hmon, genomax and Santosh Anand, you guys are awesome!

ADD REPLYlink modified 8 months ago by genomax65k • written 8 months ago by imesi20

good to know that things are working. If script is running in current directory, then you can as well remove, path="." to shorten further. By default, R searches in current directory. imesi

ADD REPLYlink modified 8 months ago • written 8 months ago by cpad011211k

It is running in the current directory, but I won't touch it: if ain't broke..... I tried t modify the script to specify an output_file per the manual but it failed to run.

ADD REPLYlink written 8 months ago by imesi20

Okay. I agree with you. If it is not broken, don't fix it. Did the script run successfully till the end?

ADD REPLYlink written 8 months ago by cpad011211k

It is still running and will for a little while. I have 9 files and each is about 170 million reads. The first 2 files were completed in about 280 minutes each. I guess I am looking at a total run time of over a day.

ADD REPLYlink written 8 months ago by imesi20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1681 users visited in the last hour