Question: Remove specific reads from fastq file
0
gravatar for chrys
3.7 years ago by
chrys40
Germany
chrys40 wrote:

Hi, I have a fastq file with reads of the following type:

>     @SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Is there a way to remove all reads where no or very little base calls were achieved ? Is it correct to assume that:

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Thanks,

fastq reads • 1.6k views
ADD COMMENTlink modified 3.7 years ago by GenoMax92k • written 3.7 years ago by chrys40
1

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Don't think that is correct. Check the Illumina fastq header format. SRR066153.1 bit comes from fastq-dump'ing the reads from SRA without using -F option.

Did you look farther down in the file? For old illumina data like this (seems to be original GA data) first several reads in the file used to have all N's.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by GenoMax92k

Ah ok thank you very much for that information. I was not aware that old Illumina data contains all N's reads. Maybe just removing x amount of reads at the start of the file will do the trick.

ADD REPLYlink written 3.7 years ago by chrys40

BBMap should auto-detect the quality encoding format but if it does not be sure to add qin=64 since this is old solexa format data (ref the link from WikiPedia above). You can also change the format of the Q-scores to more recent Illumina/sanger format by using qout=33 when you use reformat.sh.

ADD REPLYlink written 3.7 years ago by GenoMax92k

Did you run fastQC and check the quality ?

ADD REPLYlink written 3.7 years ago by venu6.7k

Yes I did. The quality is rather poor but I don't mind the overall quality but for my purpose it is rather important that I don't have to many ambiguous reads in there and there seem to be quite a few with a lot of uncalled bases which I would like to remove.

ADD REPLYlink written 3.7 years ago by chrys40
2
gravatar for GenoMax
3.7 years ago by
GenoMax92k
United States
GenoMax92k wrote:

You can use reformat.sh from BBMap suite with the maxns= option (If 0 or greater, reads with more Ns than this (after trimming) will be discarded).

ADD COMMENTlink written 3.7 years ago by GenoMax92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2305 users visited in the last hour