Sequence Database Curator
Sequence database curator
https://github.com/Eslam-Samir-Ragab/Sequence-database-curator
It is a very fast program and it can deal with:
- Nucleotide sequences
- Protein sequences
It can work under Operating systems:
- Windows
- Mac
- Linux
It also works for:
- Fasta format
- Fastq format
This program can curate nucleotide and/or protein databases from redundant and partial redundant sequences for a specific gene and /or any groups of genes.
Input:
- File containing all the different downloaded sequences in FASTA or FASTQ format.
- Or a directory containing all your desired files in the same extension.
Processing:
- It removes the redundant sequences.
- It removes the partial sequences that are exact part from other sequences in your database.
Options:
- Working on either protein (p) or nucleotide (n) databases.
- Two approaches (largest possible length and optimum length).
- largest possible length approach: gives the longest sequence even if it exceeds the length of your gene.
- optimum length approach: gives only your gene provided you feed the approximate length of your protein.
Version 2.0 Updates:
You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.
You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.
How to use:
- You need to install python 2.7 or python 3 on your machine.
- You need to install Numpy and Biopython
- You need to install future module by pip command
- Click “Clone or download” > “Download ZIP” > extract the downloaded file.
- Open the file “sddc.py” with (python.exe).
- Windows
- U/Linux : use the command
chmod u+x database_curator.py
- Mac : use the command
python sddc.py
- State your variables and press Enter.
List of options in the program are summarized in the Read Me file
Examples
if you want to dereplicate protein sequences use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep
if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order
if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi
if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command
python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300
if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive
if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw
if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive
if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw
if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command
python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)
Updates:
Sequence Database curator accepts a directory of files for easier processing of all the files with even if mixed with other files of different extensions.
NEW !! sequence database curator program faster processing and command line usage.
Hi Wouter, First of all, I'm so glad for your comments that is a typical way of learning for me (as it is my first code while I'm not a developer). Second, it was a great improvement so, Here is the modified file database_curator.3.py Thanks again for your help. Best Regards, Eslam
Hi Eslam,
Looks like you did a great job, it's an impressive improvement indeed. Never stop learning and happy coding!
Wouter