Question: How to output fastq files from in silico digests?
0
gravatar for gdholmes007
9 days ago by
gdholmes0070 wrote:

Hi All,

I'm a bit of a scripting novice and would greatly appreciate some advice. I have fastq read files from numerous sequenced ddRAD libraries and on which I'd like to perform in silico digests (using MseI: A^ATT) and retain all fragments greater than a certain length. The output needs to be in fastq format for downstream processing. I'm able to do something like this using the Bio.Restriction package of Biopython, but the output is fasta (i.e., there's no read quality data). Does anyone have an idea on what a suitable approach would be?

Many thanks in advance.

ADD COMMENTlink modified 8 days ago by Pierre Lindenbaum107k • written 9 days ago by gdholmes0070
2
gravatar for Pierre Lindenbaum
8 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum107k wrote:

using awk:

BEGIN   {
    FS="[\t]";
    MIN_LEN=6;
    }
    {
     D=$2;
     Q=$4;
     for(;;)
        {
        n=index(D,"AATT");
        if( n ==0 ) {
            if(length(D)>=MIN_LEN) printf("%s\n%s\n%s\n%s\n",$1,D,$3,Q);
            break;
            }
        if(n>=MIN_LEN)
            {
            printf("%s\n%s\n%s\n%s\n",$1,substr(D,1,n),$3,substr(Q,1,n));
            }
        D=substr(D,n+1);
        Q=substr(Q,n+1);
        }
    }

usage:

gunzip -c in.fq.gz | paste - - - -  | awk -f script.awk
ADD COMMENTlink written 8 days ago by Pierre Lindenbaum107k
0
gravatar for finswimmer
9 days ago by
finswimmer2.1k
Germany
finswimmer2.1k wrote:

Hello,

you could add a dummy value for the phred scores. Look here.

fin swimmer

ADD COMMENTlink modified 9 days ago • written 9 days ago by finswimmer2.1k

I think OP wants to digest fastq reads and then retain fragments longer than a certain length still in fastq format with original Q scores. This may need to be done with a custom script.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax48k

That was my consideration too before commit my answer. But the OP wrote:

I'm able to do something like this using the Bio.Restriction package of Biopython, but the output is fasta (i.e., there's no read quality data).

So I thought it's ok to show a way from fasta to fastq. If the original quality values are realy needed, than of course my answer isn't valid.

fin swimmer

ADD REPLYlink written 8 days ago by finswimmer2.1k

Fair enough. Depending on OP's response we can decide.

ADD REPLYlink written 8 days ago by genomax48k

Thanks very much for your thoughts on this. I really do need the Q scores as these are needed for downstream filtering in ipyrad. I'll give Pierre's script a go and see if this does the job.

ADD REPLYlink written 8 days ago by gdholmes0070
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 654 users visited in the last hour