searching reads with a certain sequence in fastq file
4
5
Entering edit mode
6.1 years ago

I have a file say master.fastq which looks like :

 

@M00990:202:000000000-ADM27:1:1101:21678:1536 2:N:0:291 CCTTTTACCGACCCGCTCTTTCTCTCCTACGCTTATTTCCGTCTACCCTTCTCTTCACTCGCTATTTCTATTCTTAAAACTATCTTAATGTTCTGCCTTTGCTCTTTTCTTTTTTCTATAACCTCTCTACAGCCAACTCACCCATCTCCTTCCCTGCTACGCTATTCCTCTGTTAGTTTTTCTTCATCATACTTTTCTCATCTCACACTACCTTTGCACTTCTTCCTTTCCACGTCCCCTTTCTCCTACC + -----,,6;6,@+8++6+,,,<C5@,,,,,,+8,,;,,<,,,,,,,;,,;,,,<C,,,;,,8,,8CC,C,,,5,C,,,,99,+,+4,,,3,,9,,6,@<,,,,,,,9,,,,,,,4),,0**,,5),,5**,)59*0),,*)5,)))*,9,3,0++))5D:)))))5;+;+0)*;+*6++++),******3**50,6,+++**,0*,,31)88*0*1*5*1)0*:*7>C;3,035:0)))8).*2**.*:) @M00990:202:000000000-ADM27:1:1101:22685:1539 2:N:0:291 CTTATCACCGACTCTCTCCTTCTCTTCCAAGTTTATTTCCGACTCCCCTTATCTTCACTTGCTATTTCTATTCTTAAAACTATCTCGACCTTTCACCTTTCCCTCTTTCCTTCTTTTCTCTCCTTCTACACTCCCACCCACTCTTACTTCTTTCTTGTCACCGTTTCCATATTATACTTTCTTCTCTTACATAATTTTCTTCCTGCAAACTATTTAAGCAATCTCTTTCTTTCACCCCTTTTATCTCGCC + -----,,<;67@+B,,6,,,,;C5@,,,,,66<,,6,,<,,,66,,;,66,,;;C,,,;,4<,,<CC6C,;,,,;,,6,:,6?A9=,,+++2,5,,,9<?D,,,,,:C?,@,@,,5,?*3*,9**,,0**,*)93))4+))19*0**,,52,56+**5*03*)3)))42+2***5*+=3+,*4*2****,**2*,3,+++*0,50,,**5*****0****)5)***0**,,***3*)0)))3***0*)))

I want only reads that have the sequence "AAGTTGATAACGGACTAGCCTTATTTT" in them. I tried grep but lose the fastq format. Can you suggest how I can retain the fastq format in the output, thanks

fastq grep • 15k views
ADD COMMENT
6
Entering edit mode
6.1 years ago
iraun ★ 3.8k

Try grep with the following arguments:

grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/--/d' > out.fq

 

grep's -A 2 option will give you two line after and  -B 1 will give you one line before the match of the grep. Also add a sed command in order to remove the '--' lines that grep adds to output.

 

ADD COMMENT
0
Entering edit mode

This does not work as the sed removes some lines of fastq qual scores

ADD REPLY
0
Entering edit mode

Most linux grep programs take the --no-group-separator flag, which does what it says on the tin.  Don't think it works on OS X, though.

ADD REPLY
0
Entering edit mode
Try with:

sed '/^--$/d'
ADD REPLY
0
Entering edit mode
grep followed by sed '/^--$/d' worked well, thanks
ADD REPLY
0
Entering edit mode

using the command LC_ALL=C fgrep instead of grep would be much faster. Because the string is fixed and does not contain a regular expression.

ADD REPLY
5
Entering edit mode
6.1 years ago

I'd generally suggest BBDuk for this kind of use:

bbduk.sh in=reads.fq out=matched.fq outu=unmatched.fq k=27 mm=f literal=AAGTTGATAACGGACTAGCCTTATTTT

This has the advantage that you can specify, for example, "hdist=1" to get all the reads that contain the sequence with up to 1 mismatch, it works for formats other than fastq, and it also looks for the reverse-complement (unless you add the flag "rcomp=f").

ADD COMMENT
1
Entering edit mode

Beyond allowing for mismatch and handling reverse_complement, it also looks like from here this tool has the advantage that it will also grab the mate if paired-end reads supplied.

ADD REPLY
3
Entering edit mode
6.1 years ago

This is an alternative solution. Preserves fastq format, only unix tools:

zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz

 

 

ADD COMMENT
0
Entering edit mode
6.1 years ago
Varun Gupta ★ 1.2k

Using only grep with something like this:

grep -A 2 -B 1 "AAGTTGATAACGGACTAGCCTTATTTT" file.fq |grep -v "^\-\-$"  > output.fastq
ADD COMMENT

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6