Question: Parse Fastq File - Pad Reads With N'S
1
gravatar for Steffi
8.5 years ago by
Steffi570
Germany
Steffi570 wrote:

I would like to parse a fastq file as follows: Nominal read length is 100, but some reads are shorter than others. In that case I want to pad these reads with N's and their corresponding quality values as well. So in the end all reads should have the same length. Any idea how to do this most quickly? Shell scripting or perl? Or does somebody knows a tool who already offers something along these lines? Thanx!

fastq parsing • 4.0k views
ADD COMMENTlink modified 3.5 years ago by Longshotx 50 • written 8.5 years ago by Steffi570
1

I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.

ADD REPLYlink written 8.5 years ago by lh332k
1

this is just nuts, just contact the RUM people about this issue

ADD REPLYlink written 8.5 years ago by Jeremy Leipzig19k

Is Python OK for you? Do you have a particular quality score in mind for the padded N bases?

ADD REPLYlink written 8.5 years ago by Peter5.8k

Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.

ADD REPLYlink written 8.5 years ago by Steffi570

Ah sorry, I did not answer the quality score question: I think it does not matter, just put anything (valid).

ADD REPLYlink written 8.5 years ago by Steffi570

I have done this, but til they have checked this I just want to go on with the data analysis...

ADD REPLYlink written 8.5 years ago by Steffi570
8
gravatar for Istvan Albert
8.5 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

As long as the sequences in the fastq file do not span multiple lines you can get by with a very elegant and fast python solution that uses iterators. This avoids all conditionals and other complexities, it could look like this:

import itertools, string, sys

# strip newlines from the standard input
stream = itertools.imap(string.strip, sys.stdin)

# this works only if sequences span only one line
for head in stream:

    # advance the stream
    seq  = stream.next()
    tmp  = stream.next()
    qual = stream.next()

    # this is how much to pad
    size = 100 - len(seq)

    # pad each line
    seq  = seq  + "N" * size
    qual = qual + "2" * size

    # print the joined elements
    print "\n".join( (head, seq, tmp, qual) )
ADD COMMENTlink written 8.5 years ago by Istvan Albert ♦♦ 84k
1

Very short and elegant solution! Only 4 lines at a time are loaded in memory? It reads from the stdin, so I can use it like this? python awesome.py < myfile.fastq > mypaddedfile.fastq

ADD REPLYlink written 8.5 years ago by Frédéric Mahé3.0k

yes to all, perhaps I should add it to the description

ADD REPLYlink written 8.5 years ago by Istvan Albert ♦♦ 84k

Thanks! I have used it and it worked perfectly fine and very fast!

ADD REPLYlink written 8.5 years ago by Steffi570

This is wonderful. Thanks! A note to anyone else who might run into the same thing I did: I had to change size to 101.

ADD REPLYlink written 8.4 years ago by User 238310

This is an old thread but still.. any reason why you pad the quality line with "2"s? Wouldn't the lowest quality score "!" make more sense here?

ADD REPLYlink written 4.9 years ago by aebmad20

I forgot the exact reason and it may not be applicable  here - I think the rationale in general when generating qualities is that some tools ignore bases that are very low in quality so we need to avoid making them too low. But then why not make them maximal? There is a reason there too, a quality such as say E can come from different types of encoding, so using that  makes the encoding the file ambiguous so that too may be a problem. Picking 2 seems like a good choice between indicating Sanger encoding and avoiding bases being ignored. 

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 84k
3
gravatar for Rok
8.5 years ago by
Rok190
Trondheim, Norway
Rok190 wrote:

GNU Awk version:

awk 'NR%4==0 {print t $0}; NR%4==2 {while(a++<100-length($0)){s=s "Q"; t=t "N"};print s $0}; NR%2==1 {print}' file.fastq

You can define different characters for replacement in sequence line and quality score line(N and Q above).

GNU Sed version (do not use, really slow):

sed -e :a -e '2~4 s/^.\{1,99\}$/N&/;ta' -e '4~4 s/^.\{1,99\}$/Q&/;ta' file.fastq
ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Rok190

this is neat, we have kind of a code-golf going on

ADD REPLYlink written 8.5 years ago by Istvan Albert ♦♦ 84k

I thought awk version would need less time than Python version, but on 2 million lines they needed the same time.

ADD REPLYlink written 8.5 years ago by Rok190

I would have thought the other way around since awk concatenates one by one - real life tests are often surprising

ADD REPLYlink written 8.5 years ago by Istvan Albert ♦♦ 84k

Is the awk version correct? It pads at the beginning of the line, not at the end. Plus, quality and sequence paddings seem to be inverted. Rok can you edit your post, please. I propose this corrected solution: awk 'NR%4==0 {print $0 t}; NR%4==2 {while(a++<100-length($0)){s=s "N"; t=t "Q"};print $0 s}; NR%2==1 {print}' file.fastq

ADD REPLYlink written 8.5 years ago by Frédéric Mahé3.0k
3
gravatar for Gustavo
8.5 years ago by
Gustavo530
Seattle
Gustavo530 wrote:

No Perl version yet? :)

#!/bin/env perl
$target = 100;
while (<>) {
    if (/^(\@|\+)/) {
        print;
        $_ = <>;
        chomp;
        print $_, ($1 eq '@' ? 'N' : '2') x ($target-length()), "\n";
    }
}
ADD COMMENTlink written 8.5 years ago by Gustavo530
1
gravatar for Damian Kao
8.5 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Here's something in python. Doesn't require biopython. It's pretty ugly, but should get the job done:

import sys

inFile = open(sys.argv[1],'r')

header = seq = qual = ''
seqs = quals = False

for line in inFile:
    if line[0] == "@":
        if header != '':
            if len(seq) < 100:
                pad = 100-len(seq)
                seq += 'N' * pad
                qual += '>' * pad

            print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

        header = line[1:].strip()
        quals = False
        seqs = True
        qual = seq = ''
    elif line[0] == "+":
        if quals:
            qual += line.strip()
        else:
            quals = True
            seqs = False
    else:
        if quals:
            qual += line.strip()
        else:
            seq += line.strip()

if len(seq) < 100:
    pad = 100-len(seq)
    seq += 'N' * pad
    qual += '>' * pad

print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

use it by: python theScript.py yourFile.fastq > output.fastq

ADD COMMENTlink written 8.5 years ago by Damian Kao15k

thanks a lot!! I will try it and let you know how it worked!

ADD REPLYlink written 8.5 years ago by Steffi570
1
gravatar for Frédéric Mahé
8.5 years ago by
France, Montpellier, CIRAD
Frédéric Mahé3.0k wrote:

Here's a pure Bash solution. The code could be factorized and a little bit simplified, but it works.

#! /bin/bash
# name: pad_fastq.sh
# usage: bash pad_fastq.sh myfile.fastq

header=""
seq=""
plus=""
qual=""

while read line ; do
    if [[ ${line:0:1} == "@" ]] ; then
        header="${line}"
    elif [[ ${line:0:1} == "+" ]] ; then
        plus="${line}"
    elif [[ "${plus}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" ">" ; done)
            qual="${line}${padding}"
        else
            qual="${line}"            
        fi
        echo -e "${header}\n${seq}\n${plus}\n${qual}"
        header=""
        seq=""
        plus=""
        qual=""
    elif [[ "${header}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" "N" ; done)
            seq="${line}${padding}"
        else
            seq="$line"            
        fi
    fi
done < "${1}"

exit 0
ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Frédéric Mahé3.0k
0
gravatar for Longshotx
3.5 years ago by
Longshotx 50
Longshotx 50 wrote:

Can this be done with fasta files?

ADD COMMENTlink written 3.5 years ago by Longshotx 50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 641 users visited in the last hour