Question

Remove specific reads from fastq file

0

Entering edit mode

7.1 years ago

chrys ▴ 60

Hi, I have a fastq file with reads of the following type:

>     @SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Is there a way to remove all reads where no or very little base calls were achieved ? Is it correct to assume that:

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Thanks,

fastq reads • 2.7k views

ADD COMMENT • link updated 7.1 years ago by GenoMax 141k • written 7.1 years ago by chrys ▴ 60

1

Entering edit mode

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Don't think that is correct. Check the Illumina fastq header format. SRR066153.1 bit comes from fastq-dump'ing the reads from SRA without using -F option.

Did you look farther down in the file? For old illumina data like this (seems to be original GA data) first several reads in the file used to have all N's.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Ah ok thank you very much for that information. I was not aware that old Illumina data contains all N's reads. Maybe just removing x amount of reads at the start of the file will do the trick.

ADD REPLY • link 7.1 years ago by chrys ▴ 60

0

Entering edit mode

BBMap should auto-detect the quality encoding format but if it does not be sure to add qin=64 since this is old solexa format data (ref the link from WikiPedia above). You can also change the format of the Q-scores to more recent Illumina/sanger format by using qout=33 when you use reformat.sh.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Did you run fastQC and check the quality ?

ADD REPLY • link 7.1 years ago by venu 7.1k

0

Entering edit mode

Yes I did. The quality is rather poor but I don't mind the overall quality but for my purpose it is rather important that I don't have to many ambiguous reads in there and there seem to be quite a few with a lot of uncalled bases which I would like to remove.

ADD REPLY • link 7.1 years ago by chrys ▴ 60

score 2 · Answer 1 · 2017-03-28

2

Entering edit mode

7.1 years ago

GenoMax 141k

You can use reformat.sh from BBMap suite with the maxns= option (If 0 or greater, reads with more Ns than this (after trimming) will be discarded).

ADD COMMENT • link 7.1 years ago by GenoMax 141k