Question

Count and location of strings in fastq file reads

1

Entering edit mode

7.7 years ago

rubic ▴ 270

Hi,

I would like to count the occurrence of a set of strings in reads in fastq files, per location of the match in the read sequence. More specifically I'd like to count the occurrence of a match to a string such as this: ACAGTA**TATAAGTATGG Where * can be any nucleotide.

So the result would be per each read position how many matches were counted for the query.

Is there any tool, preferably an R package that can do that? Efficiently?

RNA-Seq kmer suffix tree • 6.8k views

ADD COMMENT • link updated 7.3 years ago by Lars Juhl Jensen 11k • written 7.7 years ago by rubic ▴ 270

1

Entering edit mode

See @Brian's answer below.

I assume you have generated the 150 bp reads referred to in the other thread. You can use seal.sh to try and demultiplex them as shown below.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

See this thread: searching reads with a certain sequence in fastq file Adding -c option to grep should get you counts.

ADD REPLY • link 7.7 years ago by GenoMax 141k

score 3 · Answer 1 · 2016-08-02

Here's the solution of seqkit.

Preparing pattern file in FASTA format, replacing * with N

$ more patterns.fa 
>Pattern_A
ACAGTANNTATA
>Pattern_B
ATACGACGTANNCG

Searching on a FASTQ file with 9M SE100 reads.

$ seqkit locate --degenerate --pattern-file patterns.fa dataset_C.fq > result.tsv

elapsed time: 1m:22s
peak rss: 15.05 MB

The computation bottleneck is finding all matches with regular expression, which is not effective in Golang. Time is linearly dependent to the number of patterns.

1 pattern   1m:01s
2 patterns  1m:21s
3 patterns  1m:44s

Result:

seqID                                     patternName   pattern          strand   start   end   matched
K00137:236:H7NLVBBXX:6:2103:25702:14344   Pattern_A     ACAGTANNTATA     +        35      46    ACAGTATCTATA
K00137:236:H7NLVBBXX:6:1104:15412:4462    Pattern_A     ACAGTANNTATA     -        32      43    ACAGTACCTATA
K00137:236:H7NLVBBXX:6:1107:31081:13429   Pattern_A     ACAGTANNTATA     -        9       20    ACAGTAAATATA
K00137:236:H7NLVBBXX:6:1117:29894:16295   Pattern_B     ATACGACGTANNCG   -        58      71    ATACGACGTAAACG
K00137:236:H7NLVBBXX:6:1226:5152:25931    Pattern_A     ACAGTANNTATA     -        26      37    ACAGTATCTATA

Counting needs some help of shell

$ seqkit fx2tab patterns.fa | cut -f 2 |  while read pattern <&0; do echo -e $pattern"\t"$(grep -c $pattern result.tsv); done
ACAGTANNTATA    1241
ATACGACGTANNCG  24

score 2 · Answer 2 · 2016-08-02

2

Entering edit mode

7.7 years ago

Brian Bushnell 20k

You can count how many reads match the pattern with BBDuk (and capture them), like this:

bbduk.sh in=reads.fq outm=matched.fq literal=ACAGTANNTATAAGTATGG k=19 copyundefined mm=f

It's extremely fast. That will also match reverse-complements, so if you don't want that, add "rcomp=f".

ADD COMMENT • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Can this be used to demultiplex data (that is where @rubic's question is coming from, there are inline barcodes involved and every read needs to be cut down by a fixed length on right)? I seem to vaguely recollect that being a possibility.

ADD REPLY • link 7.7 years ago by GenoMax 141k

1

Entering edit mode

Seal can demultiplex, and has the same "copyundefined" functionality for generating all possible variations of degenerate bases. For example:

seal.sh in=reads.fq ref=barcodes.fa pattern=out_%.fq outu=unmatched.fq mm=f k=19

For the file "barcodes.fa":

>X
ACAGTANNTATAAGTATGG
>Y
ACAGTANNTATAAGTACCC
>Z
ACAGTANNTATAAGTAAAA

...it would produce 4 output files:

out_X.fq
out_Y.fq
out_Z.fq
unmatched.fq

If the match was only desired in the first 19bp, due to inline barcodes, the flag "restrictleft=19" or "restrictright=19" could be added. The inline barcodes could be trimmed in a second pass with the "ftr" or "ftl" flag.

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

score 1 · Answer 3 · 2016-08-01

1

Entering edit mode

7.7 years ago

Pierre Lindenbaum 161k

gunzip -c input.fastq.gz |\
paste - - - - | cut -f 2 |\
awk  '{x1=index($0,"ACAGTA"); if(x1==0) next; x2=index($0,"TATAAGTATGG"); if(x2>x1) print x1;}' |\
sort | uniq -c

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 161k