Question: Genomic Restriction Finder
0
gravatar for rhasbunz
4.5 years ago by
rhasbunz10
Chile
rhasbunz10 wrote:

Dear all

I need to create a bed file with all restriction sites of a genome in FASTA. The bed table need to have in parallel the result for three different enzymes and it is necessary to put the strand of site.

An example of an extract table is:

chr18 10786 10790 CCGC 1000 +
chr18 10790 10794 CCGC 1000 +
chr18 10822 10826 CCGC 1000 +
chr18 10828 10832 CCGC 1000 +
chr18 10830 10834 GCGG 1000 -
chr18 10947 10951 GCGG 1000 -
chr18 10976 10980 GCGG 1000 -
chr18 11005 11009 GCGG 1000 -
chr18 11034 11038 GCGG 1000 -
chr18 11063 11067 GCGG 1000 -
chr18 11098 11102 GCGG 1000 -
chr18 11133 11137 GCGG 1000 -
chr18 11255 11259 GCGG 1000 -
chr18 11399 11403 CCGC 1000 +
chr18 11430 11434 CCGC 1000 +
chr18 11738 11742 GCGG 1000 -
chr18 11760 11764 CCGC 1000 +

I am using a script written by Ryan Dale for CGAT modules but it generate an incomplete table for my species.

scaffold_1    571    575
scaffold_1    2276    2280
scaffold_1    2610    2614
scaffold_1    2632    2636
scaffold_1    2639    2643
scaffold_1    2670    2674
scaffold_1    2773    2777
scaffold_1    2860    2864
scaffold_1    4544    4548
scaffold_1    5764    5768
scaffold_1    7321    7325
scaffold_1    7603    7607

How could complete the script (see below) to generate the table I that I wish?

Best and I hope your suggestions


#!/usr/bin/python
usage ="""

Makes a BED file of the cut sites of a specified restriction enzyme.

Example usage:

# Get BED file of DpnI sites in dm3.fa
python restriction-finder.py --fasta dm3.fa --enzyme DpnI --bed DpnI-sites.bed

# can pipe to BedTools to get, e.g, sites in genes::
python restriction-finder.py --fasta myfasta.fa --enzyme DpnI | intersectBed -a stdin -b genes.bed > DpnI-in-genes.bed


Created 13 Aug 2010 by Ryan Dale"""
try:
    from Bio import SeqIO
    from Bio import Restriction
except ImportError:
    sys.stderr.write("\nPlease install BioPython to use this script <http://biopython.org/wiki/Biopython>\n")
import optparse
import sys
import os

op = optparse.OptionParser(usage=usage)
op.add_option('--fasta', help='Required FASTA file containing sequences to search')
op.add_option('--enzyme', help='Required enzyme name, case sensitive (e.g., DpnI or EcoRI)')
op.add_option('--bed',help='BED file to create. If not specified, output will print to stdout.')
options,args = op.parse_args()

# Input error checking...
def err(s):
    op.print_help()
    sys.stderr.write('\n***ERROR: %s***\n\n'%s)
    sys.exit(1)
# Hack to import just the enzyme you want from the Restriction module
if options.enzyme is None:
    err('Please specify an enzyme with --enzyme')
if options.fasta is None:
    err('Please specify a FASTA file with --fasta')
try:
    exec('from Bio.Restriction import %s as restr' % options.enzyme)
except ImportError:
    err('No restriction enzyme "%s" found!' % options.enzyme)

if not os.path.exists(options.fasta):
    err('FASTA file %s not found'%options.fasta)

if options.bed is None:
    fout = sys.stdout
else:
    fout = open(options.bed,'w')


# Let BioPython do the work...
parser = SeqIO.parse(options.fasta,'fasta')
for chrom in parser:
    sys.stderr.writechrom.name+'\n')
    hits = restr.search(chrom.seq)
    for hit in hits:
        values = [chrom.name,
                  str(hit),
                  str(hit+4)]
        fout.write('\t'.join(values)+'\n')
        fout.flush()
if options.bed is not None:
    fout.close()
fasta • 5.2k views
ADD COMMENTlink modified 3.1 years ago • written 4.5 years ago by rhasbunz10

I think I've fixed all of your code formatting. You can denote something as code be beginning a line with 4 spaces. Particularly with python, it's tough to follow something otherwise.

ADD REPLYlink written 4.5 years ago by Devon Ryan79k

Have you read the relevant section of the biopython cookbook?

ADD REPLYlink written 4.5 years ago by Devon Ryan79k

the "strand of site" for a palindromic sequence ?

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

Not all restriction enzymes are palindromic :)

ADD REPLYlink written 4.5 years ago by Devon Ryan79k

that's why I said, "palindromic" ;-)

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

All my enzymes are not palindromic; HpaII AciI HinGI

ADD REPLYlink written 4.5 years ago by rhasbunz10

I see. OK.

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

see also In silico genome digestion ?/

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

My problem is here in the script:

for chrom in parser:
    sys.stderr.writechrom.name+'\n')
    hits = restr.search(chrom.seq)
    for hit in hits:
        values = [chrom.name, str(hit), str(hit+4)]
        fout.write('\t'.join(values)+'\n')
        fout.flush()

I don't know how to incorporate the enzyme name and strand of site in the bed table.

ADD REPLYlink modified 4.5 years ago by Devon Ryan79k • written 4.5 years ago by rhasbunz10

You'll want to create a RestrictionBatch. When you search with that, everytime you run the search you'll get a dict returned, so you'll know which enzyme is doing the cutting. For the strand, you'll have to manually look as I don't think the biopython methods return that (they might assume palindromic sequences, I don't know).

ADD REPLYlink written 4.5 years ago by Devon Ryan79k
2
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum107k wrote:

I wrote a tool to digest a genome: https://github.com/lindenb/jvarkit/wiki/Biostar86480

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz" |\
gunzip -c  |\
java -jar dist/biostar86480.jar -E AarI -E EcoRI 



chr3    60645    60651    GAATTC    1000    +    EcoRI    G^AATTC
chr3    60953    60959    GAATTC    1000    +    EcoRI    G^AATTC
chr3    68165    68172    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    70263    70269    GAATTC    1000    +    EcoRI    G^AATTC
chr3    70945    70952    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    71140    71146    GAATTC    1000    +    EcoRI    G^AATTC
chr3    72264    72270    GAATTC    1000    +    EcoRI    G^AATTC
chr3    74150    74156    GAATTC    1000    +    EcoRI    G^AATTC
chr3    75063    75069    GAATTC    1000    +    EcoRI    G^AATTC
chr3    78438    78444    GAATTC    1000    +    EcoRI    G^AATTC
chr3    81052    81059    CACCTGC    1000    +    AarI    CACCTGC(4/8)
chr3    84498    84504    GAATTC    1000    +    EcoRI    G^AATTC
chr3    84546    84552    GAATTC    1000    +    EcoRI    G^AATTC
chr3    84780    84787    CACCTGC    1000    +    AarI    CACCTGC(4/8)
chr3    87771    87777    GAATTC    1000    +    EcoRI    G^AATTC
chr3    95344    95351    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    96358    96364    GAATTC    1000    +    EcoRI    G^AATTC
chr3    96734    96740    GAATTC    1000    +    EcoRI    G^AATTC
chr3    105956    105962    GAATTC    1000    +    EcoRI    G^AATTC
chr3    107451    107457    GAATTC    1000    +    EcoRI    G^AATTC
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Pierre Lindenbaum107k

Hi Pierre, thanks for the answer. I use your instruction but the result was the next:

curl -s "ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Egrandis/assembly/Egrandis_201.fa.gz" |\

gunzip -c |\

java -jar dist/biostar86480.jar -E HpaII AciI Hin6I

Unable to access jarfile dist/biostar86480.jar

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by rhasbunz10

did you compile the tool ?

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

Hi Pierre, no doubt that the development tool is perfect for my purpose but I can't install jvarkit. Given my inexperience I don't understand the instruction in https://github.com/lindenb/jvarkit/wiki/Compilation#edit-buildproperties Can you help me?

Rodrigo

$ ant biostar86480 Buildfile: /Users/rhasbun/jvarkit/build.xml

biostar86480: [echo] if the next javac statement fails, check that the 'build.properties' has been properly configured [javac] Compiling 1 source file to /Users/rhasbun/jvarkit/tmp

BUILD FAILED /Users/rhasbun/jvarkit/build.xml:974: The following error occurred while executing this line: /Users/rhasbun/jvarkit/build.xml:213: The following error occurred while executing this line: /Users/rhasbun/jvarkit/build.xml:153: /commun/data/packages/picard-tools-1.100 does not exist.

ADD REPLYlink written 4.5 years ago by rhasbunz10

download and extract the picard library from https://sourceforge.net/projects/picard/files/picard-tools/1.100/ with a text editor open build.properties and set the path in picard.dir=/path/to/the/directory/containing/picard-100

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

Perfect Pierre, It's work!!! Only one question: How can incorporate one enzyme? For Hin6I the error message is: Cannot find enzyme null in RE list

Best

ADD REPLYlink written 4.5 years ago by rhasbunz10

You're rright, it's missing in src/main/java/com/github/lindenb/jvarkit/util/bio/Rebase.java . Add it in the code and recompile or use HhaI (cuts GCG^C)

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

Fantastic...it's perfect....thanks so much Pierre

ADD REPLYlink written 4.5 years ago by rhasbunz10

Hi Pierre, it's possible to use this tool for detect all CpG sites of a genome in FASTA? The output formate (bed table) would be the same

ADD REPLYlink written 4.5 years ago by rhasbunz10

ask this as a new question please.

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum107k

Hi Pierre,

I would like to apply the tool for a local reference genome stored in my MAC, How can make it?

Best

Rodrigo

ADD REPLYlink written 3.6 years ago by rhasbunz10

java -jar dist/biostar86480.jar -E AarI -E EcoRI < your.genome.fa

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum107k

Hi Pierre, thank you for your tool that is very useful for me. But still one question, can i output the results to a file and count the number of the sites automaticly for each enzyme?

ADD REPLYlink written 3.5 years ago by austindai0

"can i output the results to a file" : see http://www.tldp.org/LDP/abs/html/io-redirection.html

"count the number of the sites automaticly for each enzyme" : http://www.linuxjournal.com/article/7396

ADD REPLYlink written 3.5 years ago by Pierre Lindenbaum107k

Sorry, i am the beginner of linux, thank you for your help. Additional no related question is OK?

I have analyzed the HpaII and MboI sites in all promoter sequence of refgenes. I get the results as follows:

For example,

NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    109    113    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    176    180    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    204    208    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    831    835    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    980    984    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1047    1051    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1358    1362    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1911    1915    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    717    721    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1361    1365    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1791    1795    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1860    1864    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1954    1958    CCGG    1000    +    HpaII    C^CGG

So next i want to pick the genes in which the seqences cantain both HpaII and MspI sites, if possible set the at least number for each site as parameter, then print out the name of the genes (Line1 of the results) and the number for each site, how should i do, could you help me , thank you very much!

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by austindai0

Hi Pierre, Thanks a lot for this awesome tool. I am not sure if commenting on this old thread is good, but I had a quick question about the tool. I wanted to cut hg18 using 2 enzymes whose name I do not know (it's provided by a company) but I do know that the restriction sites for the 2 are - GATC^, and GANT^C (N=A/T/G/C). Do you think there is a way to do this using this tool? Thanks!

ADD REPLYlink written 10 days ago by chahatupretiiitk50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 624 users visited in the last hour