Question

Mapping Paired-End Rna-Seq Data With Rum

1

Entering edit mode

12.2 years ago

Steffi ▴ 580

Hi,

I want to map paired-end RNA-Seq data with RUM. I have done adapter clipping before with cutadapt. Now I have two files, one for read1, one for read2. Both files contain the same number of reads, so I just have valid mate pairs. But, as I have done adapter clipping before, some reads are shorter than the original length. Therefore the file sizes of the two files are not equal. Unfortunately RUM requires the two input files to have the same file size.

Any idea for a work around? Do I have to parse through my file and fill up the too-short reads and their corresponding quality values?

I have talked to the author of RUM. He will have a look at this restriction. Meanwhile he recommended to pad the shorter reads with Ns.

best, steffi

rna adaptor • 3.8k views

ADD COMMENT • link 12.1 years ago by Steffi ▴ 580

0

Entering edit mode

can you point to where it says files must be the same size?

ADD REPLY • link 12.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I have started the mapping with RUM. RUM produces a log file during mapping. There it says: "The forward and reverse files are different size. They should be the exact same size".

ADD REPLY • link 12.2 years ago by Steffi ▴ 580

0

Entering edit mode

See this:

Synchronization Of Pair-End Reads

ADD REPLY • link updated 4.6 years ago by zx8754 11k • written 12.2 years ago by brentp 24k

0

Entering edit mode

I do not have a problem with the pairing of my reads. This I have worked out. I do not want to delete all reads where I have found an adapter. I just want to use the shorter reads then. So I guess I will have to write a script to fill them up to the original length.

ADD REPLY • link 12.2 years ago by Steffi ▴ 580

0

Entering edit mode

I haven't looked at the source code, but I bet that it would be fairly easy to remove the code implementing that check.

ADD REPLY • link 12.2 years ago by Chris Miller 22k

score 0 · Answer 1 · 2012-02-07

That is a poorly written error on RUM's part, or it is being absurdly strict. Why would it need the exact number of nucleotides for each pair of sequences?

cutadapt is pair-safe - i.e. it will not create widows and orphans unless you want it to

[leipzig@localhost testpairsafety]$ cat pair2.fq
@HWI-ST431_52:1:1:1259:1981/1
ATCTCGTATGCCGTCTTCTGCTTG
+
b`ZUYZKYUSV[[_[cad\\W\[X
[leipzig@localhost testpairsafety]$ cutadapt -a ATCTCGTATGCCGTCTTCTGCTTG pair2.fq > pair2.trimmed.fq
cutadapt version 0.9.5
Command line parameters: -a ATCTCGTATGCCGTCTTCTGCTTG pair2.fq
Maximum error rate: 10.00%
   Processed reads: 1
     Trimmed reads: 1 (100.0%)
   Too short reads: 0 (  0.0% of processed reads)
    Too long reads: 0 (  0.0% of processed reads)
        Total time:      0.00 s
     Time per read:      0.00 ms

=== Adapter 1 ===

Adapter 'ATCTCGTATGCCGTCTTCTGCTTG', length 24, was trimmed 1 times.

Histogram of adapter lengths
length  count
24  1

[leipzig@localhost testpairsafety]$ cat pair2.trimmed.fq 
@HWI-ST431_52:1:1:1259:1981/1

+

[leipzig@localhost testpairsafety]$

the question is whether RUM will accept empty sequences. If not, you might have to substitute a single "N" of low quality for those.