Question: searching reads with a certain sequence in fastq file
4
gravatar for sumithrasank75
3.2 years ago by
United States
sumithrasank7590 wrote:

I have a file say master.fastq which looks like :

 

@M00990:202:000000000-ADM27:1:1101:21678:1536 2:N:0:291 CCTTTTACCGACCCGCTCTTTCTCTCCTACGCTTATTTCCGTCTACCCTTCTCTTCACTCGCTATTTCTATTCTTAAAACTATCTTAATGTTCTGCCTTTGCTCTTTTCTTTTTTCTATAACCTCTCTACAGCCAACTCACCCATCTCCTTCCCTGCTACGCTATTCCTCTGTTAGTTTTTCTTCATCATACTTTTCTCATCTCACACTACCTTTGCACTTCTTCCTTTCCACGTCCCCTTTCTCCTACC + -----,,6;6,@+8++6+,,,<C5@,,,,,,+8,,;,,<,,,,,,,;,,;,,,<C,,,;,,8,,8CC,C,,,5,C,,,,99,+,+4,,,3,,9,,6,@<,,,,,,,9,,,,,,,4),,0**,,5),,5**,)59*0),,*)5,)))*,9,3,0++))5D:)))))5;+;+0)*;+*6++++),******3**50,6,+++**,0*,,31)88*0*1*5*1)0*:*7>C;3,035:0)))8).*2**.*:) @M00990:202:000000000-ADM27:1:1101:22685:1539 2:N:0:291 CTTATCACCGACTCTCTCCTTCTCTTCCAAGTTTATTTCCGACTCCCCTTATCTTCACTTGCTATTTCTATTCTTAAAACTATCTCGACCTTTCACCTTTCCCTCTTTCCTTCTTTTCTCTCCTTCTACACTCCCACCCACTCTTACTTCTTTCTTGTCACCGTTTCCATATTATACTTTCTTCTCTTACATAATTTTCTTCCTGCAAACTATTTAAGCAATCTCTTTCTTTCACCCCTTTTATCTCGCC + -----,,<;67@+B,,6,,,,;C5@,,,,,66<,,6,,<,,,66,,;,66,,;;C,,,;,4<,,<CC6C,;,,,;,,6,:,6?A9=,,+++2,5,,,9<?D,,,,,:C?,@,@,,5,?*3*,9**,,0**,*)93))4+))19*0**,,52,56+**5*03*)3)))42+2***5*+=3+,*4*2****,**2*,3,+++*0,50,,**5*****0****)5)***0**,,***3*)0)))3***0*)))

I want only reads that have the sequence "AAGTTGATAACGGACTAGCCTTATTTT" in them. I tried grep but lose the fastq format. Can you suggest how I can retain the fastq format in the output, thanks

grep fastq • 7.0k views
ADD COMMENTlink modified 3.2 years ago by Varun Gupta1000 • written 3.2 years ago by sumithrasank7590
6
gravatar for iraun
3.2 years ago by
iraun3.4k
Norway
iraun3.4k wrote:

Try grep with the following arguments:

grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/--/d' > out.fq

 

grep's -A 2 option will give you two line after and  -B 1 will give you one line before the match of the grep. Also add a sed command in order to remove the '--' lines that grep adds to output.

 

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by iraun3.4k

This does not work as the sed removes some lines of fastq qual scores

ADD REPLYlink written 3.2 years ago by sumithrasank7590

Most linux grep programs take the --no-group-separator flag, which does what it says on the tin.  Don't think it works on OS X, though.

ADD REPLYlink written 3.2 years ago by george.ry1.1k
Try with:

sed '/^--$/d'
ADD REPLYlink written 3.2 years ago by iraun3.4k
grep followed by sed '/^--$/d' worked well, thanks
ADD REPLYlink written 3.2 years ago by sumithrasank7590

using the command LC_ALL=C fgrep instead of grep would be much faster. Because the string is fixed and does not contain a regular expression.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Prakki Rama2.1k
3
gravatar for Brian Bushnell
3.2 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

I'd generally suggest BBDuk for this kind of use:

bbduk.sh in=reads.fq out=matched.fq outu=unmatched.fq k=27 mm=f literal=AAGTTGATAACGGACTAGCCTTATTTT

This has the advantage that you can specify, for example, "hdist=1" to get all the reads that contain the sequence with up to 1 mismatch, it works for formats other than fastq, and it also looks for the reverse-complement (unless you add the flag "rcomp=f").

ADD COMMENTlink written 3.2 years ago by Brian Bushnell15k
1

Beyond allowing for mismatch and handling reverse_complement, it also looks like from here this tool has the advantage that it will also grab the mate if paired-end reads supplied.

ADD REPLYlink written 11 months ago by Wayne220
3
gravatar for dariober
3.2 years ago by
dariober9.2k
Glasgow - UK
dariober9.2k wrote:

This is an alternative solution. Preserves fastq format, only unix tools:

zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz

 

 

ADD COMMENTlink written 3.2 years ago by dariober9.2k
0
gravatar for Varun Gupta
3.2 years ago by
Varun Gupta1000
United States
Varun Gupta1000 wrote:

Using only grep with something like this:

grep -A 2 -B 1 "AAGTTGATAACGGACTAGCCTTATTTT" file.fq |grep -v "^\-\-$"  > output.fastq
ADD COMMENTlink written 3.2 years ago by Varun Gupta1000
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1590 users visited in the last hour