Tool:Sequence Database curator
1
3
Entering edit mode
7.2 years ago
Eslam Samir ▴ 110

Sequence Database Curator

Sequence database curator

https://github.com/Eslam-Samir-Ragab/Sequence-database-curator

commercial photography locations

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

This program can curate nucleotide and/or protein databases from redundant and partial redundant sequences for a specific gene and /or any groups of genes.

Input:

  • File containing all the different downloaded sequences in FASTA or FASTQ format.
  • Or a directory containing all your desired files in the same extension.

Processing:

  1. It removes the redundant sequences.
  2. It removes the partial sequences that are exact part from other sequences in your database.

Options:

  1. Working on either protein (p) or nucleotide (n) databases.
  2. Two approaches (largest possible length and optimum length).
    • largest possible length approach: gives the longest sequence even if it exceeds the length of your gene.
    • optimum length approach: gives only your gene provided you feed the approximate length of your protein.

commercial photography locations

Version 2.0 Updates:

  1. You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.

  2. You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.

How to use:

  1. You need to install python 2.7 or python 3 on your machine.
  2. You need to install Numpy and Biopython
  3. You need to install future module by pip command
  4. Click “Clone or download” > “Download ZIP” > extract the downloaded file.
  5. Open the file “sddc.py” with (python.exe).
    • Windows
    • U/Linux : use the command chmod u+x database_curator.py
    • Mac : use the command python sddc.py
  6. State your variables and press Enter.

List of options in the program are summarized in the Read Me file

Examples

if you want to dereplicate protein sequences use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep

if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order

if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi

if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300

if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive

if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw

if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive

if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw

if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)

Updates:

  1. Sequence Database curator accepts a directory of files for easier processing of all the files with even if mixed with other files of different extensions.

  2. NEW !! sequence database curator program faster processing and command line usage.

  3. Check SDDC v2.0.

sequence fastq sequencing blast • 2.8k views
ADD COMMENT
4
Entering edit mode
7.2 years ago

Hi Eslam Samir,

Since you ask for suggestions and have bumped this thread again I had a look at your python code, starting with https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/blob/master/database_curator.3.py.

Probably my suggestions are generally applicable so also for the other scripts.
I'm not suggesting your code is a disaster, but since you welcome any suggested improvements I'll be pedantic and give you my unfiltered thoughts. If you (or someone else) disagrees on some points I would be happy to discuss this. I think we can all learn from each other.

  1. Use the with open(filename) as f syntax for opening and closing files. This syntax is recommended over an open() and .close() call
  2. Use Biopython SeqIO for parsing fasta files. Although your own function probably works fine for most cases, one day an exceptional fasta will be presented and there will be dragons. SeqIO takes a bit longer to parse but takes all corner cases into account.
  3. Add some white lines in your code to separate code blocks, as such increasing readability (which is crucial in python)
  4. Biopython also has a reverse complement function, saving you the pain of writing your own. Although it's good to invent a wheel because then you know how the wheel works, it's often not necessary and you can benefit using someone else's work. Have a look at the Biopython Tutorial and Cookbook
  5. Why not import sys once at the beginning of your code, instead of importing it every time again? Most commonly, on the very top of your script you will load all required modules
  6. A lot of your code could be simplified/improved using list comprehensions, so you might want to take a look into those. They're more concise and faster.

I would be happy to review your code again in a newer version. There are more elements that can be improved, but let's take this step by step.

Cheers,
Wouter

ADD COMMENT
2
Entering edit mode

Hi Wouter, First of all, I'm so glad for your comments that is a typical way of learning for me (as it is my first code while I'm not a developer). Second, it was a great improvement so, Here is the modified file database_curator.3.py Thanks again for your help. Best Regards, Eslam

ADD REPLY
2
Entering edit mode

Hi Eslam,

Looks like you did a great job, it's an impressive improvement indeed. Never stop learning and happy coding!

Wouter

ADD REPLY

Login before adding your answer.

Traffic: 3829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6