Remove specific reads from fastq file
1
0
Entering edit mode
7.1 years ago
chrys ▴ 60

Hi, I have a fastq file with reads of the following type:

>     @SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.2 SOLEXA-1GA-2_1:1:1:0:752 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.3 SOLEXA-1GA-2_1:1:1:0:1166 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.4 SOLEXA-1GA-2_1:1:1:0:1804 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     @SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>     +SRR066153.5 SOLEXA-1GA-2_1:1:1:0:286 length=36
>     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Is there a way to remove all reads where no or very little base calls were achieved ? Is it correct to assume that:

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Thanks,

fastq reads • 2.7k views
ADD COMMENT
1
Entering edit mode

@SRR066153.1 SOLEXA-1GA-2_1:1:1:0:715 length=36

This zero indicates the quality and I can somehow remove reads with this quality?

Don't think that is correct. Check the Illumina fastq header format. SRR066153.1 bit comes from fastq-dump'ing the reads from SRA without using -F option.

Did you look farther down in the file? For old illumina data like this (seems to be original GA data) first several reads in the file used to have all N's.

ADD REPLY
0
Entering edit mode

Ah ok thank you very much for that information. I was not aware that old Illumina data contains all N's reads. Maybe just removing x amount of reads at the start of the file will do the trick.

ADD REPLY
0
Entering edit mode

BBMap should auto-detect the quality encoding format but if it does not be sure to add qin=64 since this is old solexa format data (ref the link from WikiPedia above). You can also change the format of the Q-scores to more recent Illumina/sanger format by using qout=33 when you use reformat.sh.

ADD REPLY
0
Entering edit mode

Did you run fastQC and check the quality ?

ADD REPLY
0
Entering edit mode

Yes I did. The quality is rather poor but I don't mind the overall quality but for my purpose it is rather important that I don't have to many ambiguous reads in there and there seem to be quite a few with a lot of uncalled bases which I would like to remove.

ADD REPLY
2
Entering edit mode
7.1 years ago
GenoMax 141k

You can use reformat.sh from BBMap suite with the maxns= option (If 0 or greater, reads with more Ns than this (after trimming) will be discarded).

ADD COMMENT

Login before adding your answer.

Traffic: 2647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6