Sequence Database Curator

Question

Tool:Sequence Database curator

3

Entering edit mode

7.2 years ago

Eslam Samir ▴ 110

Sequence Database Curator

Sequence database curator

https://github.com/Eslam-Samir-Ragab/Sequence-database-curator

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

This program can curate nucleotide and/or protein databases from redundant and partial redundant sequences for a specific gene and /or any groups of genes.

Input:

File containing all the different downloaded sequences in FASTA or FASTQ format.
Or a directory containing all your desired files in the same extension.

Processing:

It removes the redundant sequences.
It removes the partial sequences that are exact part from other sequences in your database.

Options:

Working on either protein (p) or nucleotide (n) databases.
Two approaches (largest possible length and optimum length).
- largest possible length approach: gives the longest sequence even if it exceeds the length of your gene.
- optimum length approach: gives only your gene provided you feed the approximate length of your protein.

Version 2.0 Updates:

You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.
You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.

How to use:

You need to install python 2.7 or python 3 on your machine.
You need to install Numpy and Biopython
You need to install future module by pip command
Click “Clone or download” > “Download ZIP” > extract the downloaded file.
Open the file “sddc.py” with (python.exe).
- Windows
- U/Linux : use the command chmod u+x database_curator.py
- Mac : use the command python sddc.py
State your variables and press Enter.

List of options in the program are summarized in the Read Me file

Examples

if you want to dereplicate protein sequences use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep

if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order

if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi

if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300

if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive

if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw

if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive

if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw

if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)

Updates:

Sequence Database curator accepts a directory of files for easier processing of all the files with even if mixed with other files of different extensions.
NEW !! sequence database curator program faster processing and command line usage.
Check SDDC v2.0.

sequence fastq sequencing blast • 2.8k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 7.2 years ago by Eslam Samir ▴ 110

score 4 · Accepted Answer · 2017-02-04

Hi Eslam Samir,

Since you ask for suggestions and have bumped this thread again I had a look at your python code, starting with https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/blob/master/database_curator.3.py.

Probably my suggestions are generally applicable so also for the other scripts.
I'm not suggesting your code is a disaster, but since you welcome any suggested improvements I'll be pedantic and give you my unfiltered thoughts. If you (or someone else) disagrees on some points I would be happy to discuss this. I think we can all learn from each other.

Use the with open(filename) as f syntax for opening and closing files. This syntax is recommended over an open() and .close() call
Use Biopython SeqIO for parsing fasta files. Although your own function probably works fine for most cases, one day an exceptional fasta will be presented and there will be dragons. SeqIO takes a bit longer to parse but takes all corner cases into account.
Add some white lines in your code to separate code blocks, as such increasing readability (which is crucial in python)
Biopython also has a reverse complement function, saving you the pain of writing your own. Although it's good to invent a wheel because then you know how the wheel works, it's often not necessary and you can benefit using someone else's work. Have a look at the Biopython Tutorial and Cookbook
Why not import sys once at the beginning of your code, instead of importing it every time again? Most commonly, on the very top of your script you will load all required modules
A lot of your code could be simplified/improved using list comprehensions, so you might want to take a look into those. They're more concise and faster.

I would be happy to review your code again in a newer version. There are more elements that can be improved, but let's take this step by step.

Cheers,
Wouter