Question

khmer-2.0 normalize-by-median.py fails to recognise paired reads

0

Entering edit mode

7.4 years ago

chacazoulou • 0

Dear all, I'm transitioning from khmer-0.7.1 to khmer-2.0 and my pipepline (PE Illumina raw reads->Trimomatic->interleave-reads.py->extract-paired-reads.py->normalize-by-mean.py) fails to recognise the reads as paired reads at the digital normalization step, as suggested by the following error message from normalize-by-median.py in paired mode :

making countgraph
** ERROR: Unpaired reads when require_paired is set!
** Failed on interleaved_trimmed_WTCHG_322035_270_1.fastq:
** Exiting!

I went to the script but can't find where the problem apart from maybe this is_paired variable(?) from line 100 onwards.

i found some information (http://lists.idyll.org/pipermail/khmer/2015-December/000766.html) regarding is_pair from the interleave-reads.py script and it seems that it requires the reads to be sorted /1 then /2, which they are in my files.

But I'm not sure if this is anyhow related with is_paired from normalize-by-median.py, and I can't find any more information as to why normalize-by-median.py might not recognize my input file as paired reads. Does anyone have any idea what I might be doing wrong (Im sure there's plenty), and what I could try to solve this issue?

Please let me know if I should be providing more/less information (or different information) ; this is my first on such forums. Many thanks Clement

software error alignment next-gen sequencing • 2.2k views

ADD COMMENT • link 7.4 years ago by chacazoulou • 0

0

Entering edit mode

Perhaps khmer is expecting the PE reads to be in independent (R1/R2) files (are you using an interleaved reads file for khmer step)?

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

Hi, yes I use interleaved reads, but that's what's recommended for khmer (http://khmer-protocols.readthedocs.io/en/v0.8.2/mrnaseq/1-quality.html , @ adaptors trimming step). Also, there's only one input file possible, so I don't believe that independent could even be specified.

ADD REPLY • link 7.4 years ago by chacazoulou • 0

0

Entering edit mode

Is it possible that you have lost pairing of the reads in your trimmed data? Do you get the error right away or after sometime? You could use reformat.sh from BBMap to verify pairing of the reads like reformat.sh in=reads.fastq verifypairing

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

thanks for the reply,

I get the error from the beginning of normalize-by-mean.py ; it doesn't seem to process any read. (the error meassage cited in my original post is the first piece of log that is prompted)

I doubt that the trimmed data contains orphaned reads or has headers in a format that isn't recognized by khmer, because the khmer-2.0 in-house scripts used upstream in the pipeline, check for pairing of the trimmed reads, and they work just fine :

Interleaving:
        trimmed_WTCHG_322035_225_1.fastq
        trimmed_WTCHG_322035_225_2.fastq
... 0 pairs
... 100000 pairs
... 200000 pairs
... 300000 pairs
etc... until end of files

Whereas unpaired files give the following error message (well, paired but in obsolete format apparently): Interleaving:

        norm_225_1.fq
        norm_225_2.fq
... 0 pairs
ERROR: This doesn't look like paired data! K00150:134:HF5LHBBXX:8:1101:2027:1244/1 K00150:134:HF5LHBBXX:8:1101:2027:1244/2

In addition to that, I then run the extract-paired-reads.py script from khmer-2.0 which is specifiacally designed at seperating orphaned reads from paired in an interleaved reads file, and it tells me that 100% of my reads are paired /0% orphaned:

reading file "khmer-interleaved_270.fq"
outputting interleaved pairs to "khmer-interleaved_270.fq.pe"
outputting orphans to "khmer-interleaved_270.fq.se"
... 100000
... 200000
... 300000
etc...
DONE; read 2062100 sequences, 1031050 pairs and 0 singletons

So, it would seem that my input files are indeed paired but not recognised as such by khmer-2.0 normalize-by-median.py for some reason. Maybe, a relevant piece of information would be the format of my input files, so here it is :

dops0510@evo:/var/tmp/Clement/CCUV_mutants$ head khmer-interleaved_270.fq.pe
@K00150:134:HF5LHBBXX:8:1101:6197:1261 1:N:0:NTCAATCA
TGCAGAAGCCATTCCTGGCAGAGTTTGTGCGTGACAGAGTGGCCGAACGAGTATCCTAGATCAGACGCTGGAAG
+
AAFFJJJFJJJJJJJJJAJJJJJJJJFJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJ
@K00150:134:HF5LHBBXX:8:1101:6197:1261 2:N:0:NTCAATCA
ACTGAGAAGAATTGCGTGCTGATCGATTCCTCCGCCGTGAATTGTTTTTGCCTCCTCCGCAGCAGGTCCATCG
+
A<FFJJJFFJJFFJJJJJJJJJJJJ7<AJJFJJJJJJA-A<-AA-AJJFFJJ7F7<JJJJ7<J7F)7AAAFFJ

Thanks again

ADD REPLY • link updated 7.4 years ago by GenoMax 141k • written 7.4 years ago by chacazoulou • 0

score 0 · Answer 1 · 2016-11-17

Alright, it's solved. Thanks again for your support genomax2 (and the reformating of my messages, couldn't get that right).

Basically, it was not an issue with the input files (which is probably why I was I puzzled, to say the least), but with the way that I was calling the normalize-by-median.py script. I was calling it with system from within a perl script , and for reasons beyond my (poorly educated) understanding, calling it directly from the command line with the exact same parameters, worked fine! (the paths was correct to the normalize-by-median.py script were also the same)

Cheers Clement