Question

searching reads with a certain sequence in fastq file

5

Entering edit mode

9.0 years ago

sumithrasank75 ▴ 140

I have a file say master.fastq which looks like :

@M00990:202:000000000-ADM27:1:1101:21678:1536 2:N:0:291 CCTTTTACCGACCCGCTCTTTCTCTCCTACGCTTATTTCCGTCTACCCTTCTCTTCACTCGCTATTTCTATTCTTAAAACTATCTTAATGTTCTGCCTTTGCTCTTTTCTTTTTTCTATAACCTCTCTACAGCCAACTCACCCATCTCCTTCCCTGCTACGCTATTCCTCTGTTAGTTTTTCTTCATCATACTTTTCTCATCTCACACTACCTTTGCACTTCTTCCTTTCCACGTCCCCTTTCTCCTACC + -----,,6;6,@+8++6+,,,<C5@,,,,,,+8,,;,,<,,,,,,,;,,;,,,<C,,,;,,8,,8CC,C,,,5,C,,,,99,+,+4,,,3,,9,,6,@<,,,,,,,9,,,,,,,4),,0**,,5),,5**,)59*0),,*)5,)))*,9,3,0++))5D:)))))5;+;+0)*;+*6++++),******3**50,6,+++**,0*,,31)88*0*1*5*1)0*:*7>C;3,035:0)))8).*2**.*:) @M00990:202:000000000-ADM27:1:1101:22685:1539 2:N:0:291 CTTATCACCGACTCTCTCCTTCTCTTCCAAGTTTATTTCCGACTCCCCTTATCTTCACTTGCTATTTCTATTCTTAAAACTATCTCGACCTTTCACCTTTCCCTCTTTCCTTCTTTTCTCTCCTTCTACACTCCCACCCACTCTTACTTCTTTCTTGTCACCGTTTCCATATTATACTTTCTTCTCTTACATAATTTTCTTCCTGCAAACTATTTAAGCAATCTCTTTCTTTCACCCCTTTTATCTCGCC + -----,,<;67@+B,,6,,,,;C5@,,,,,66<,,6,,<,,,66,,;,66,,;;C,,,;,4<,,<CC6C,;,,,;,,6,:,6?A9=,,+++2,5,,,9<?D,,,,,:C?,@,@,,5,?*3*,9**,,0**,*)93))4+))19*0**,,52,56+**5*03*)3)))42+2***5*+=3+,*4*2****,**2*,3,+++*0,50,,**5*****0****)5)***0**,,***3*)0)))3***0*)))

I want only reads that have the sequence "AAGTTGATAACGGACTAGCCTTATTTT" in them. I tried grep but lose the fastq format. Can you suggest how I can retain the fastq format in the output, thanks

grep fastq • 25k views

ADD COMMENT • link updated 9 months ago by elina2410 • 0 • written 9.0 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

Hi! I got from fastqc report on overrepresented sequences, I copied one of these sequences (GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAA) and tried to find it in fastq file using:

grep 'GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAA' N1_S50_L001_R2_001.fastq.gz

there is no result. Why?

ADD REPLY • link updated 9 months ago by GenoMax 141k • written 9 months ago by elina2410 • 0

0

Entering edit mode

grep can not directly work on a gzip-compressed file.

gzip -d N1_S50_L001_R2_001.fastq.gz | grep GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAA

ADD REPLY • link 9 months ago by shenwei356 8.4k

0

Entering edit mode

You could use zgrep instead of plain grep, if it is available. It will work with compressed files.

ADD REPLY • link 9 months ago by GenoMax 141k

0

Entering edit mode

Yes, it worked. So simple.. Thank you!

ADD REPLY • link 9 months ago by elina2410 • 0

Ram · Answer 1 · 2015-04-17

6

Entering edit mode

9.0 years ago

iraun 6.2k

Try grep with the following arguments:

grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/--/d' > out.fq

grep's -A 2 option will give you two lines after and -B 1 will give you one line before the match of the grep. Also add a sed command in order to remove the '--' lines that grep adds to output.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.0 years ago by iraun 6.2k

0

Entering edit mode

This does not work as the sed removes some lines of fastq qual scores

ADD REPLY • link 9.0 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

Most linux grep programs take the --no-group-separator flag, which does what it says on the tin. Don't think it works on OS X, though.

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by george.ry ★ 1.2k

0

Entering edit mode

Try with:

sed '/^--$/d'

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by iraun 6.2k

0

Entering edit mode

grep followed by sed '/^--$/d' worked well, thanks

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

using the command LC_ALL=C fgrep instead of grep would be much faster. Because the string is fixed and does not contain a regular expression.

ADD REPLY • link 9.0 years ago by Prakki Rama ★ 2.7k

Ram · Answer 2 · 2015-04-20

6

Entering edit mode

9.0 years ago

Brian Bushnell 20k

I'd generally suggest BBDuk for this kind of use:

bbduk.sh in=reads.fq out=matched.fq outu=unmatched.fq k=27 mm=f literal=AAGTTGATAACGGACTAGCCTTATTTT

This has the advantage that you can specify, for example, hdist=1 to get all the reads that contain the sequence with up to 1 mismatch, it works for formats other than fastq, and it also looks for the reverse-complement (unless you add the flag rcomp=f).

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.0 years ago by Brian Bushnell 20k

1

Entering edit mode

Beyond allowing for mismatch and handling reverse_complement, it also looks like from here this tool has the advantage that it will also grab the mate if paired-end reads supplied.

ADD REPLY • link 6.8 years ago by Wayne ★ 1.9k

Ram · Answer 3 · 2015-04-20

3

Entering edit mode

9.0 years ago

dariober 14k

This is an alternative solution. Preserves fastq format, only unix tools:

zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.0 years ago by dariober 14k

Ram · Answer 4 · 2015-04-20

1

Entering edit mode

9.0 years ago

Varun Gupta ★ 1.3k

Using only grep with something like this:

grep -A 2 -B 1 "AAGTTGATAACGGACTAGCCTTATTTT" file.fq |grep -v "^\-\-$"  > output.fastq

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.0 years ago by Varun Gupta ★ 1.3k