Tool: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) SFF/FastQ files
19
gravatar for BioApps
4.6 years ago by
BioApps720
Spain
BioApps720 wrote:

Hi. I am a biologist in need of a good graphic/visual/fast FastQ editor. Starting from a Biostars thread I implemented few days ago my own SFF/FastQ editor. I hope this is the most complete SFF/FastQ editor available. If you want a specific feature implemented just let me know.

 


Features

Supported files

  • SFF, FastQ, FQ, Fasta (soon)

Filters

  • Cut reads with average QV under specified threshold
  • Cut reads if they contain N bases (the user can specify how many)
  • Cut low complexity reads
  • Cut reads that are too short
  • Cut reads that are too long
  • Cut low quality ends. Automatically detect and cut low quality bases at the end of each read
  • Cut poly-A/T tails

Tools and converters

  • Dereplicate sequences (to be released soon!)
  • Split multiplexed files (MID/barcode splitter)
  • Remove contaminants (search over represented sequences against a contaminant database)
  • File splitter: Split huge FastQ/SFF file in chunks of x reads
  • File splitter: Cut all sequences in the specified range
  • Compact FastQ files
  • Convert SFF to FastQ
  • Convert SFF to Fasta
  • Convert FastQ to Fasta (multiFasta)
  • Convert FastQ file to a different encoding (under development)

Graphs and data analysis

  • Sequence viewer - Show all reads: Read name, Base sequence, average quality, sequence length
  • Sequence length distribution graph
  • Per base sequence quality graph
  • Per base GC content graph
  • Per base sequence content graph
  • Per base N content graph (integrated in the 'Per Base Content' graph)
  • Per sequence quality scores graph graph
  • Graphs can be expanded to full screen
  • All graphs are update in real time as the file is processed

 

Download link

Version 3.2.3 (released August 2015) can be downloaded here. The size of this program is about 4 MB. No installer needed.

Dereplication is now also available (app). Statistic data about clusters included in Dereplicator.

'Follow' this post to stay up to date.

http://www.dnabaser.com/download/nextgen-fastq-editor/screenshot_big.png

 

Requirements:

  • <3MB of disk space
  • no installation
  • no Java
  • no .Net
  • no admin permissions
  • no money :)

 

Speed & mem footprint:

On an old Toshiba laptop (i5, 2.2GHz) it loads a 0.5GB file in under 11sec (if not processing is applied). This includes also the time for determining the file encoding (Solexa, Illumina, Sanger). The memory footprint should exceeds 15-30MB. I am thinking about doing the file decoding and the data processing in separate threads.

 

Your feedback

The program was built on feedback from users. So, please comment on things such as:

  • Feature requests
  • Platform you are interested in (Windows, Mac, Linux) - This is very important!
  • Statistics about your files (file type, how many, file size) and your working station (CPU/RAM)
  • Which of the already modules are you interested in (so we can improve them)
  • New request from users: Allow program resizing so it can fit on very small laptop screens

 

This tool integrates with Avalanche Workbench.

sff tool next-gen fastq sequence • 16k views
ADD COMMENTlink modified 3.2 years ago • written 4.6 years ago by BioApps720
9

Linux, Linux, Linux and Linux. Without the linux support, you are excluding >90% of potential users.

ADD REPLYlink written 4.2 years ago by lh331k
4

And OS X. Biased sample, maybe, but I just don't see too many folks here with Windows laptops doing informatics work. It's all Linux and OS X for real work.

ADD REPLYlink written 4.2 years ago by Alex Reynolds26k

I agree. There are lots of Mac users in biology field. The port for Linux/Mac is schedule.

ADD REPLYlink written 4.2 years ago by BioApps720

Until the Linux port will be available (I promise it will be), the program can be used under Linux via Wine.

ADD REPLYlink written 4.2 years ago by BioApps720
2

Wine is rarely used in bioinfo. For your next project, please take linux/mac as a prerequisite, not an afterthought. Thank you.

ADD REPLYlink written 4.2 years ago by lh331k
3

we often want to look at one file in a run but almost never would open all files in a sequencing run.

Usually they share many characteristics. Your software should have the option of running as a command line tool as well.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 78k
1

>command line

You mean to access the tools via that command line?
 

ADD REPLYlink written 4.6 years ago by BioApps720

yes, like fastqc the program should run from command line if just some non-graphical functionality is needed. 

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 78k
1

I forgot to mention that it requires that much time only when you open a file for the first time. Opening the file subsequently requires below 1 sec.

 

ADD REPLYlink written 4.6 years ago by BioApps720

it should never need to have the file in memory, except for dereplicate so it should be able to handle files of any size except for that function, correct?

ADD REPLYlink written 4.6 years ago by brentp22k
1

Yes. As you can see in the screenshot the program needs only 38MB for showing a 500MB file.

ADD REPLYlink written 4.6 years ago by BioApps720

Cool. Then why this limitation: "On a modest computer (with 3GB RAM) the program should theoretically open files up to 40GB"?

ADD REPLYlink written 4.6 years ago by brentp22k

Well, the index in loaded in memory. The more sequences you have, the larger the index. Some calculus shows that it should parse a file with up to 375million sequences, which is equivalent of a 80GB file IF the sequences are about 100 bases each (40GB is for 200 bases/sequence).

Obviously, on a computer with more RAM you could open even larger files. But for the moment the program is 32 bit. The 64 bit version should be ready soon. Then the Linux and Mac versions.

Now I am trying to integrate SFF into the same GUI.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by BioApps720
2

ok but why would you need to load the entire index into memory? after all the user will not actually scroll through hundreds of millions of reads. There is this common flaw, often seen in text editors where opening a large file loads it all up in memory, yet a person only edits or looks at one page at a time.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 78k
2

Update. Now the program will take 15MB of RAM no matter how large the file it is. There will be an update these days available for download.

ADD REPLYlink written 4.6 years ago by BioApps720

If you need to perform an operation like getting the viewing, average quality, sorting, etc, you will need to parse all samples. Therefore, you need the index. Probably the index is need for any operation that applies to all samples. Please let me know if you have a different approach. I will try to incorporate it if results in smaller mem footprint.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by BioApps720
1

I think one only need an index if they need to access random parts of the file at high speeds. Reading the all records in an entire file to compute certain values should not need indices.

In your case the only use of an index would be to jump to a read of a given name, this is almost never needed in practice. 

I would suggest to make your tool be exceedingly fast and use trivial amounts of memory (I mean megabytes) regardless of the file size. This means a very fast parser and streaming the view rather than preloading it. Now that would be a tool that would separate itself.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 78k

Hi Albert!

The technical reasons for keeping the index in memory:

1. I don't read the file in text mode (line by line). Instead I read it in binary mode because I have a library that buffers the I/O operations. This results in better performance.

2. The viewer (let user scroll and see all sequences)

3. This is not only a viewer. I intend to add all kind of tools that will need random access to sequences (that is the main reason).

4. RAM is cheap. On a computer with 6-8GB RAM (which is quite common today, especially if you are a biologist that is working with large files :) ) the user will be able to open files around 160-210GB. I am not sure what is the largest FastQ file ever created, but I think 160-210GB is a nice range.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by BioApps720
2

Well I think you should be careful with this and find yourself a collaborator solve their problems. Many things that make sense in theory do not make any difference in practice and do not scale for real work. Good software needs to solve real pain.

It is very rare for example that I would be going back to a fastq file after checking it out once, so the time that it saves me for the second opening is pointless. I just want to look at it once and verify that things look right. For that I would hate to be sitting for minutes staring at a GUI having the memory be consumed when I can do that faster by other means, fastqc will run in the background and I can limit the memory it uses. Moreover it is very likely that the project will have dozens of fastq files associated with it, I would hate having to click and open and wait for each.

So you see how your tool would give me very little reason to use it.

 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Istvan Albert ♦♦ 78k
1

I also have to agree. Normally, nobody would ever look at the fastq-file. Overall statistics are important here (take a look at fastqc). The sequences themselves are not really interesting. They are far too many, anyway. How long can you scroll down your list? ;)

I would recommend you: Talk to biologists or bioinformaticians and ask them, what they really need!

What I would like/need:

  • very fast adapter prediction and clipping algorithm (for my 16GB fastq files the available ones are very slow, so I skip it)
  • fast bam statistics
    - percentage of mapped reads, mapped mates, unmapped mates
      (multiple mappings should be counted once)
    - percentage +/- strand mappings
    - library complexity
    - multiple mappings statistics
    - DNA: mean + median + stdev of coverage
  • nice visualization tool for fusion transcripts

These are just three of many many more.... :)

 

ADD REPLYlink written 4.6 years ago by David Langenberger8.3k

(I agree with Istvan) The reason for an index is random access. The things you are proposing, e.g. viewing parts of a file or doing operations on the entire file do not require an index. For example, if they want to scroll backwards, you can seek().

ADD REPLYlink written 4.6 years ago by brentp22k

I edited the original post to show this is rather an editor than a simple viewer (true that for the moment only one editing function is ready :) )

ADD REPLYlink written 4.6 years ago by BioApps720

Ok guys. Thanks for feedback. I heard your voices. I will get rid of index :)

Please keep sending good feedback!

ADD REPLYlink written 4.6 years ago by BioApps720

Limitation removed!

ADD REPLYlink written 4.2 years ago by BioApps720

I would need it to view gzipped files. Our fastq are stored as gzip and dumping them out is a lot of IO hassle. Then let us know when the "linux port" is ready, because the windows machines are kept far away from the valuable data (more IO hassles). 

ADD REPLYlink written 4.6 years ago by karl.stamm3.4k

Hi Karl. You mean, you want to work on your packed FastQ file without unpacking the whole file (this may be quite difficult) or you want the program to quietly unpack the program in the background and work on that temporary file? But, yes, I see how this may be a nice feature. Most Linux users probably have their files packed this way.

PS: One possible solution to your packing problem (but works only on Windows) would be not to pack your files with Zip but to use the default packing tool offered by NTFS ("Compress content to save disk space"). A 86MB file gets compressed to 43MB using the NTFS and 25MB using a zip algorithm. Which is not very bad if you consider that 'unpacking' the file is instantaneous.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by BioApps720

Have a look at zlib. Reading gzip files is very easy.

ADD REPLYlink written 4.5 years ago by lh331k

Thanks. I was happy to see there is a port for Delphi also. I will take a look.

ADD REPLYlink written 4.5 years ago by BioApps720

Update. Now the program will take 15MB of RAM no matter how large the file it is. There will be an update these days available for download.

ADD REPLYlink written 4.6 years ago by BioApps720

Adding support for SFF

ADD REPLYlink written 4.5 years ago by BioApps720
1
gravatar for BioApps
4.5 years ago by
BioApps720
Spain
BioApps720 wrote:

Release history:

 

v2.0 / August 2014

  • Redesigned GUI
  • Save reports in HTML format
  • New tool: Demultiplexing - Split based on FastQ internal info (only for Illumina files) - Load info about adaptor clipping from sequence name (comment) line
  • New tool: Demultiplexing - Split multiplex file based on barcode sequence(s) provided by user. Reverse complement sequences are also supported.
  • New tool: Demultiplexing - Split based on FastQ internal info radio box is automatically disabled if the sequence does not contain the info
  • New tool: File splitter: Split huge FastQ/SFF file in chunks of x reads
  • New tool: File splitter: Cut all sequences in the specified range
  • New tool: Revamped file convertor. Converts between Fasta, FastQ, SFF 
  • New tool: Remove overrepresented sequences.
  • New tool: Remove contaminants. Search overrepresented sequences against a contaminant database (allow user to add/remove seq from database)
  • New function in Adaptor Trimming: Cut x bases at 3' / 5' end


v1.9 / June 2014

  • New report: Sequence duplication level
  • New report: Overrepresented sequences

v1.7

  • Massive SFF/FastQ parsing speed optimization using buffered files
  • Important speed optimization when using the 'Refresh button'
  • The program is a bit more responsive when processing large files
  • Silently cut samples that have 0 good bases
  • Cut reads with GC under 15% or over 85%

v1.5

  • Added SFF support (processing, statistics, etc)
  • Tools - File splitter. Split huge FastQ/SFF file in chunks of x reads
  • Tools - Compact FastQ files (remove duplicate content of the + line)
  • Tools - Convert SFF to FastQ
  • Tools - Convert SFF to Fasta
  • Tools - Convert FastQ to Fasta (multiFasta)
  • Graph - All graphs are updated in real time (as filters are applied)
  • Graph - Sequence length distribution graph
  • Graph - Per base sequence quality graph
  • Graph - Per base GC content
  • Graph - Per sequence GC content
  • Graph - Per base sequence content
  • Graph - Per base N content (integrated in the 'Per Base Content' graph)
  • Graph - Show the 'Per sequence GC content' graphs as dot instead as lines
  • Graph - Resize graphs automatically
  • Graph - Remember height of each graph panels
  • Graph - Remember status of each graph panel (colapsed/expanded)
  • Graph - Let user scroll graphs using mouse scroll
  • Graph - Button to expand some graphs. Support for all graphs will be added soon
  • Graph - Added vertical scroll bar in graph's panel so the user can make any graph as long as he wants

v1.2

  • Tools - Trim poly-A/T tails
  • Tools - Cut reads with average QV under specified threshold
  • Tools - Cut reads if they contain N bases (the user can specify how many)
  • Tools - Cut reads longer than x bases
  • Tools - Ask where to save the file (at conversion)
  • Tools - Cut low complexity reads
  • Tools - Trim low quality ends. Automatically detect and cut low quality bases at the end of each read. Three parameters are used by this function.
  • Tools - Cut reads shorter than x bases.
  • Tools - Save the filtered file to disk (use the 'Refresh graph and save...' button).
  • Tools - Encoding auto detection was checked and works correctly.
  • Graph - Let user choose row height
  • Graph - Show all reads (no matter how many they are). It can show: Read name, Base sequence, average quality, sequence length, mini chromatogram.
  • Graph - Per sequence quality scores graph

Download link.

 

 

ADD COMMENTlink modified 3.3 years ago • written 4.5 years ago by BioApps720
1

your tool needs a name

ADD REPLYlink written 4.5 years ago by Istvan Albert ♦♦ 78k

I know !!!!!!!!!! :)

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by BioApps720
1

Next thing to come: FastQ speed improvement!

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by BioApps720
1

The version 2.0 was ready some while ago but I didn't had the time to test it properly so I left for an 'important business' (read as 'holiday') before having the chance to publish the program. Sorry. I will release v2 soon.

ADD REPLYlink written 4.2 years ago by BioApps720

Hi,

It seems mac and Linux download link is not working, do you know why or is the tool available only for the windows?

Thanks,

 

ADD REPLYlink written 4.4 years ago by GP10

Hi Gmax.

The program was not YET converted to Mac (or Linux). I intend to add few more features, some GUI improvements, bug fixes and lots of testing. Once I have a final-final version I will port it to Mac and later to Linux.

 

The ETA for v2.0 is ~7 days. Once we are there I will start to port it. For the moment the program should run without problems on Mac via CrossOver and on Linux via CrosOver or WinE.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by BioApps720
1

Ok, Thanks very much!

G

ADD REPLYlink written 4.4 years ago by GP10

I must say this is starting to look very very impressive, so here are my two cents:

As others have said, you need Linux compatibility, some people don't wine seriously.

Next thing, command line. I see you have the option for HTML reports, it would be great to have something like "./ngs-wrokbench myfile.fastq -o my-file-dataset1" which would create an HTML report of everything and the user doesn't need to specify 1000 different options to do it. The point of the report is for the tool to guess as much as it can from the given dataset and tell user all of it's finding:

  • Did I find several multiplex barcodes? How many reads in each library?
  • Does my data look like paired end?
  • What common adapters was I able to find in the dataset, are they represented significantly? Can they tell me something about the data? Maybe I found a Nextera paired end adapter, maybe I found a mate-pair adapter?
  • k-mer graph, tells us about expected genome size, frequency of polymorphisms, presence of contaminants.

Lastly, source code... this is up to you, but giving people the option to compile it themselves will give you lots of credibility because it is open source. This way it can work even on cygwin for windows systems.

ADD REPLYlink written 4.2 years ago by Adrian Pelin2.2k

Buna Adrian.

Thanks a lot for your suggestions. They are very good I will implement all of them. The k-mer was already in the ToDo list. just need the time to implement it.

 

As I already said it, definitively there will be a port for Mac/Linux platforms, but first I want to finish this (and several other) tools. Then I start the porting. Until then I am sure that the scientists that REALLY need to use my tools (if they really really want them) can use WinE/CrossOver/etc. I don't thinks they pride will be that much hurt. The final purpose of bioinformatics is the 'bio' part... finding the answers to biology-related questions. The tools (the programs, the OS, the emulators) are just...well... tools. Biologists will understand that.

 

Related to the source code, unfortunately this will never be available. I got the permission to use some bioinformatics libraries that are closed source. For Windows and Mac world this is not a problem at all since most programs are not open source (most programs are not even free). But Biostars is a Linux-biased community, so it is normal for the people here to ask for the source code. But since I released the first version many biologists contacted me and they had platform-related questions but none asked for the source code. Probably even if I will distribute it, they won't know what to do with it :) They just want a 'double-click and run' tool.

 

Thanks again for your precious feedback.

ADD REPLYlink written 4.2 years ago by BioApps720
1

Regarding releasing the source code: Would it not be possible to dynamically link against the closed source libraries you are using so they can be distributed in binary while the code of your tool is free?

I am looking forward to give your tool a try when you finish the linux version.

ADD REPLYlink written 3.9 years ago by lelle770

A number of open-source media players and transcoders do dynamic linking, given codecs that are closed source or are not able to be redistributed under the open source licensing terms.

ADD REPLYlink written 3.9 years ago by Alex Reynolds26k

I don't see why we could not do that :)

Are interested in a specific module? Maybe I can write a special function that will do exactly what you need.

Or maybe you could start the program with the GUI hidden and pass some parameters in the command line. The program will process the file and exit silently.

 

Anyway, if you need something specific just let me know.

ADD REPLYlink written 3.9 years ago by BioApps720

I will also look into plugins. I have never done this but it doesn't seem so complicated.

ADD REPLYlink written 3.9 years ago by BioApps720

Which part of your tool is close sourced?

ADD REPLYlink written 3.9 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1961 users visited in the last hour