Question

Annotating/Finding "NNNNNN" scaffold gaps in genome assembly

1

Entering edit mode

8.7 years ago

Adrian Pelin ★ 2.6k

Hello,

After denovo assembly, most algorithms (I am using Allpaths in this case) do scaffolding. This places contigs into scaffolds bridged by NNNNNN sequences.

I wanted to visually validate these bridges of NNNN sequences by mapping mate pairs back to the assemblies and checking the links. This is difficult to do because I have to manually search for these bridges.

Is there any tool, that provided a fasta file can annotate all "NNNNNN" regions into a bed file let's say? I am wondering if this can even be coded with perl simply.

Thanks,

Adrian

bed NGS scaffolding Assembly • 6.7k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Adrian Pelin ★ 2.6k

1

Entering edit mode

perl one-liner to do this.

ADD REPLY • link 5.0 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Hi there and thank you for the script. I have trouble getting it running. It says "

... , line 17 printrecord.id, match.start(), match.end(), sep='\t') ^ SyntaxError: invalid syntax

I did run it with Python27 and Biopython on a Win10 computer. Any suggestions?

ADD REPLY • link 5.8 years ago by michael • 0

0

Entering edit mode

The script was written for Python 3. Either install Python 3 or try changing line 17 to :

print record.id, match.start(), match.end()

ADD REPLY • link 5.8 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Yes, sorry I did overlook that. However, I made it work with Python3 and changed the output to gff3 . In case someone needs this:

 #!/usr/bin/env python3

 # Import necessary packages
 import argparse
 import re
 from Bio import SeqIO

 # Parse command-line arguments
 parser = argparse.ArgumentParser()
 parser.add_argument("fasta")
 args = parser.parse_args()

 # Open FASTA, search for masked regions, print in GFF3 format
 with open(args.fasta) as handle:
     for record in SeqIO.parse(handle, "fasta"):
         for match in re.finditer('N+', str(record.seq)):
             print record.id, ".", "assembly gap", match.start(), match.end(), ".", ".", ".", ".", sep='\t')

 #use the following at CMD: FILENAME.py FILENAME.fasta >> FILENAME.gff3  here

ADD REPLY • link 5.8 years ago by michael • 0

Ram · Answer 1 · 2015-07-29

9

Entering edit mode

8.7 years ago

James Ashmore ★ 3.4k

If you have BioPython installed you could use this small Python script to print such regions in BED3 format.

#!/usr/bin/env python3

# Import necessary packages
import argparse
import re
from Bio import SeqIO

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument("fasta")
args = parser.parse_args()

# Open FASTA, search for masked regions, print in BED3 format
with open(args.fasta) as handle:
    for record in SeqIO.parse(handle, "fasta"):
        for match in re.finditer('N+', str(record.seq)):
            print(record.id, match.start(), match.end(), sep='\t')

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Excellent, this works really well! Since I am not a python expert, what do I need to replace "with open('mm10.fa') as handle:" so that I can simply issue python FindN.py FASTA-file.fasta? Thanks again! I can also use awk '($3-$2) >= 50' x.bed to filter based on how long I want intervals to be.

ADD REPLY • link 8.7 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

I've edited the code in my answer so that you can specify the FASTA file on the command-line.