Question: Annotating/Finding "NNNNNN" scaffold gaps in genome assembly
1
gravatar for Adrian Pelin
3.6 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

Hello,

After denovo assembly, most algorithms (I am using Allpaths in this case) do scaffolding. This places contigs into scaffolds bridged by NNNNNN sequences.

I wanted to visually validate these bridges of NNNN sequences by mapping mate pairs back to the assemblies and checking the links. This is difficult to do because I have to manually search for these bridges.

Is there any tool, that provided a fasta file can annotate all "NNNNNN" regions into a bed file let's say? I am wondering if this can even be coded with perl simply.

Thanks,

Adrian

assembly scaffolding ngs bed • 2.3k views
ADD COMMENTlink modified 8 months ago by michael0 • written 3.6 years ago by Adrian Pelin2.2k
3
gravatar for James Ashmore
3.6 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

If you have BioPython installed you could use this small Python script to print such regions in BED3 format.

#!/usr/bin/env python3

# Import necessary packages
import argparse
import re
from Bio import SeqIO

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument("fasta")
args = parser.parse_args()

# Open FASTA, search for masked regions, print in BED3 format
with open(args.fasta) as handle:
    for record in SeqIO.parse(handle, "fasta"):
        for match in re.finditer('N+', str(record.seq)):
            print(record.id, match.start(), match.end(), sep='\t')

 

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by James Ashmore2.6k

Excellent, this works really well! Since I am not a python expert, what do I need to replace "with open('mm10.fa') as handle:" so that I can simply issue python FindN.py FASTA-file.fasta? Thanks again! I can also use awk '($3-$2) >= 50' x.bed to filter based on how long I want intervals to be.

ADD REPLYlink written 3.6 years ago by Adrian Pelin2.2k

I've edited the code in my answer so that you can specify the FASTA file on the command-line.

ADD REPLYlink written 3.6 years ago by James Ashmore2.6k

Excellent, works! Thanks!

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Adrian Pelin2.2k
0
gravatar for michael
8 months ago by
michael0
Germany/Stuttgart/University of Hohenheim
michael0 wrote:

Hi there and thank you for the script. I have trouble getting it running. It says "

... , line 17 printrecord.id, match.start(), match.end(), sep='\t') ^ SyntaxError: invalid syntax

I did run it with Python27 and Biopython on a Win10 computer. Any suggestions?

ADD COMMENTlink written 8 months ago by michael0

The script was written for Python 3. Either install Python 3 or try changing line 17 to :

print record.id, match.start(), match.end()
ADD REPLYlink written 8 months ago by James Ashmore2.6k

Yes, sorry I did overlook that. However, I made it work with Python3 and changed the output to gff3 . In case someone needs this:

 #!/usr/bin/env python3

 # Import necessary packages
 import argparse
 import re
 from Bio import SeqIO

 # Parse command-line arguments
 parser = argparse.ArgumentParser()
 parser.add_argument("fasta")
 args = parser.parse_args()

 # Open FASTA, search for masked regions, print in GFF3 format
 with open(args.fasta) as handle:
     for record in SeqIO.parse(handle, "fasta"):
         for match in re.finditer('N+', str(record.seq)):
             print record.id, ".", "assembly gap", match.start(), match.end(), ".", ".", ".", ".", sep='\t')

 #use the following at CMD: FILENAME.py FILENAME.fasta >> FILENAME.gff3  here

ADD REPLYlink modified 8 months ago • written 8 months ago by michael0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1183 users visited in the last hour