Question

How to output fastq files from in silico digests?

0

Entering edit mode

7.2 years ago

gdholmes007 • 0

Hi All,

I'm a bit of a scripting novice and would greatly appreciate some advice. I have fastq read files from numerous sequenced ddRAD libraries and on which I'd like to perform in silico digests (using MseI: A^ATT) and retain all fragments greater than a certain length. The output needs to be in fastq format for downstream processing. I'm able to do something like this using the Bio.Restriction package of Biopython, but the output is fasta (i.e., there's no read quality data). Does anyone have an idea on what a suitable approach would be?

Many thanks in advance.

fastq in silico digest ddRAD • 1.7k views

ADD COMMENT • link updated 7.2 years ago by Pierre Lindenbaum 166k • written 7.2 years ago by gdholmes007 • 0

score 2 · Answer 1 · 2018-05-17

using awk:

BEGIN   {
    FS="[\t]";
    MIN_LEN=6;
    }
    {
     D=$2;
     Q=$4;
     for(;;)
        {
        n=index(D,"AATT");
        if( n ==0 ) {
            if(length(D)>=MIN_LEN) printf("%s\n%s\n%s\n%s\n",$1,D,$3,Q);
            break;
            }
        if(n>=MIN_LEN)
            {
            printf("%s\n%s\n%s\n%s\n",$1,substr(D,1,n),$3,substr(Q,1,n));
            }
        D=substr(D,n+1);
        Q=substr(Q,n+1);
        }
    }

usage:

gunzip -c in.fq.gz | paste - - - -  | awk -f script.awk

score 0 · Answer 2 · 2018-05-17

0

Entering edit mode

7.2 years ago

finswimmer 16k

Hello,

you could add a dummy value for the phred scores. Look here.

fin swimmer

ADD COMMENT • link 7.2 years ago by finswimmer 16k

0

Entering edit mode

I think OP wants to digest fastq reads and then retain fragments longer than a certain length still in fastq format with original Q scores. This may need to be done with a custom script.

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

That was my consideration too before commit my answer. But the OP wrote:

I'm able to do something like this using the Bio.Restriction package of Biopython, but the output is fasta (i.e., there's no read quality data).

So I thought it's ok to show a way from fasta to fastq. If the original quality values are realy needed, than of course my answer isn't valid.

fin swimmer

ADD REPLY • link 7.2 years ago by finswimmer 16k

0

Entering edit mode

Fair enough. Depending on OP's response we can decide.

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

Thanks very much for your thoughts on this. I really do need the Q scores as these are needed for downstream filtering in ipyrad. I'll give Pierre's script a go and see if this does the job.

ADD REPLY • link 7.2 years ago by gdholmes007 • 0