Question: searching reads with a certain sequence in fastq file
5
gravatar for sumithrasank75
3.7 years ago by
United States
sumithrasank75100 wrote:

I have a file say master.fastq which looks like :

 

@M00990:202:000000000-ADM27:1:1101:21678:1536 2:N:0:291 CCTTTTACCGACCCGCTCTTTCTCTCCTACGCTTATTTCCGTCTACCCTTCTCTTCACTCGCTATTTCTATTCTTAAAACTATCTTAATGTTCTGCCTTTGCTCTTTTCTTTTTTCTATAACCTCTCTACAGCCAACTCACCCATCTCCTTCCCTGCTACGCTATTCCTCTGTTAGTTTTTCTTCATCATACTTTTCTCATCTCACACTACCTTTGCACTTCTTCCTTTCCACGTCCCCTTTCTCCTACC + -----,,6;6,@+8++6+,,,<C5@,,,,,,+8,,;,,<,,,,,,,;,,;,,,<C,,,;,,8,,8CC,C,,,5,C,,,,99,+,+4,,,3,,9,,6,@<,,,,,,,9,,,,,,,4),,0**,,5),,5**,)59*0),,*)5,)))*,9,3,0++))5D:)))))5;+;+0)*;+*6++++),******3**50,6,+++**,0*,,31)88*0*1*5*1)0*:*7>C;3,035:0)))8).*2**.*:) @M00990:202:000000000-ADM27:1:1101:22685:1539 2:N:0:291 CTTATCACCGACTCTCTCCTTCTCTTCCAAGTTTATTTCCGACTCCCCTTATCTTCACTTGCTATTTCTATTCTTAAAACTATCTCGACCTTTCACCTTTCCCTCTTTCCTTCTTTTCTCTCCTTCTACACTCCCACCCACTCTTACTTCTTTCTTGTCACCGTTTCCATATTATACTTTCTTCTCTTACATAATTTTCTTCCTGCAAACTATTTAAGCAATCTCTTTCTTTCACCCCTTTTATCTCGCC + -----,,<;67@+B,,6,,,,;C5@,,,,,66<,,6,,<,,,66,,;,66,,;;C,,,;,4<,,<CC6C,;,,,;,,6,:,6?A9=,,+++2,5,,,9<?D,,,,,:C?,@,@,,5,?*3*,9**,,0**,*)93))4+))19*0**,,52,56+**5*03*)3)))42+2***5*+=3+,*4*2****,**2*,3,+++*0,50,,**5*****0****)5)***0**,,***3*)0)))3***0*)))

I want only reads that have the sequence "AAGTTGATAACGGACTAGCCTTATTTT" in them. I tried grep but lose the fastq format. Can you suggest how I can retain the fastq format in the output, thanks

grep fastq • 8.5k views
ADD COMMENTlink modified 3.6 years ago by Varun Gupta1.1k • written 3.7 years ago by sumithrasank75100
6
gravatar for iraun
3.7 years ago by
iraun3.5k
Norway
iraun3.5k wrote:

Try grep with the following arguments:

grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/--/d' > out.fq

 

grep's -A 2 option will give you two line after and  -B 1 will give you one line before the match of the grep. Also add a sed command in order to remove the '--' lines that grep adds to output.

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by iraun3.5k

This does not work as the sed removes some lines of fastq qual scores

ADD REPLYlink written 3.7 years ago by sumithrasank75100

Most linux grep programs take the --no-group-separator flag, which does what it says on the tin.  Don't think it works on OS X, though.

ADD REPLYlink written 3.7 years ago by george.ry1.1k
Try with:

sed '/^--$/d'
ADD REPLYlink written 3.7 years ago by iraun3.5k
grep followed by sed '/^--$/d' worked well, thanks
ADD REPLYlink written 3.7 years ago by sumithrasank75100

using the command LC_ALL=C fgrep instead of grep would be much faster. Because the string is fixed and does not contain a regular expression.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Prakki Rama2.2k
5
gravatar for Brian Bushnell
3.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I'd generally suggest BBDuk for this kind of use:

bbduk.sh in=reads.fq out=matched.fq outu=unmatched.fq k=27 mm=f literal=AAGTTGATAACGGACTAGCCTTATTTT

This has the advantage that you can specify, for example, "hdist=1" to get all the reads that contain the sequence with up to 1 mismatch, it works for formats other than fastq, and it also looks for the reverse-complement (unless you add the flag "rcomp=f").

ADD COMMENTlink written 3.6 years ago by Brian Bushnell16k
1

Beyond allowing for mismatch and handling reverse_complement, it also looks like from here this tool has the advantage that it will also grab the mate if paired-end reads supplied.

ADD REPLYlink written 17 months ago by Wayne290
3
gravatar for dariober
3.6 years ago by
dariober9.7k
Glasgow - UK
dariober9.7k wrote:

This is an alternative solution. Preserves fastq format, only unix tools:

zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz

 

 

ADD COMMENTlink written 3.6 years ago by dariober9.7k
0
gravatar for Varun Gupta
3.6 years ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

Using only grep with something like this:

grep -A 2 -B 1 "AAGTTGATAACGGACTAGCCTTATTTT" file.fq |grep -v "^\-\-$"  > output.fastq
ADD COMMENTlink written 3.6 years ago by Varun Gupta1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1150 users visited in the last hour