Question: Identify Identical Sequences In Fastq File
2
gravatar for nupurgupta0806
6.5 years ago by
nupurgupta080630 wrote:

I have a FASTQ file with a lot of reads. I expect sets of identical sequences: in fact I will be counting for occurrences of each unique sequence.

I am using Python and Biopython, and am trying to optimize this problem for a large file. I was wondering if there are any suggestions on how to do this?

What I have so far includes a fast Biopython iterator, and MD5 hashes

    for title,seq,quals in FastqGeneralIterator(file_read_handle) : 
        seq_digest = md5.new(seq).digest
        if seq_digest in list_digest:
              ...
         else
           list_digest.append(seq_digest)
          ...

Is there any other technique for searching for exact sequence matches which might be more efficient?

Thanks very much.

fastq biopython • 4.7k views
ADD COMMENTlink written 6.5 years ago by nupurgupta080630

you may want to check for reverse complement as well

ADD REPLYlink written 6.5 years ago by Leszek4.0k
4
gravatar for Istvan Albert
6.5 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

Right off the bat make sure that you use a set and not a list as your storage datastructure. List lookups are linear and you program won't run even for small files.

A simple command line tool would be fastx_collapser

You can read a few good suggestions on how to do it in How to remove the same sequences in the FASTA files

Most of the techniques above will work on Fastq files as well.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 79k

Thanks very much! These are good suggestions. FASTX - I have found - when I download the 32-bit version, it only comes in version 0.0.12. Which does not support sanger style quality scores. Weird. But I will try the other things. Thanks very much.

ADD REPLYlink written 6.5 years ago by nupurgupta080630
2

you just need to pass the -Q33 flag to fastx to turn on Sanger mode

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 79k

THANK you so much!

ADD REPLYlink written 6.5 years ago by nupurgupta080630
0
gravatar for Ryan Thompson
6.5 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

If you have a reference to map to, it is better to identify identical reads by mapping location, not by sequence. If all you do is collapse identical sequences, you can still miss sequencing errors and such. If you map allowing an appropriate number of mismatches and then collapse reads mapped to the same location, then you will be more tolerant of sequencing errors.

ADD COMMENTlink written 6.5 years ago by Ryan Thompson3.4k

Ideally the way this should work is that one first detects the duplicates, then only aligns the non duplicated reads (therefore speeding up the process assuming de-duplication is much faster than other processing), and then perform a second level of de-duplication. Depending on the duplication rate the complexity of doing it may not be worth it.

Another issue to deal with is which read to keep? The one with the highest quality, lowest quality, random etc.

edit: it just occurred to me that for example bwa assigns reads that map to multiple location to just one selected randomly. Therefore one may not be able to de-duplicate by looking at the alignment only.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 79k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1062 users visited in the last hour