Annotating/Finding "NNNNNN" scaffold gaps in genome assembly
1
1
Entering edit mode
8.7 years ago
Adrian Pelin ★ 2.6k

Hello,

After denovo assembly, most algorithms (I am using Allpaths in this case) do scaffolding. This places contigs into scaffolds bridged by NNNNNN sequences.

I wanted to visually validate these bridges of NNNN sequences by mapping mate pairs back to the assemblies and checking the links. This is difficult to do because I have to manually search for these bridges.

Is there any tool, that provided a fasta file can annotate all "NNNNNN" regions into a bed file let's say? I am wondering if this can even be coded with perl simply.

Thanks,

Adrian

bed NGS scaffolding Assembly • 6.7k views
ADD COMMENT
1
Entering edit mode

perl one-liner to do this.

ADD REPLY
0
Entering edit mode

Hi there and thank you for the script. I have trouble getting it running. It says "

... , line 17 printrecord.id, match.start(), match.end(), sep='\t') ^ SyntaxError: invalid syntax

I did run it with Python27 and Biopython on a Win10 computer. Any suggestions?

ADD REPLY
0
Entering edit mode

The script was written for Python 3. Either install Python 3 or try changing line 17 to :

print record.id, match.start(), match.end()
ADD REPLY
0
Entering edit mode

Yes, sorry I did overlook that. However, I made it work with Python3 and changed the output to gff3 . In case someone needs this:

 #!/usr/bin/env python3

 # Import necessary packages
 import argparse
 import re
 from Bio import SeqIO

 # Parse command-line arguments
 parser = argparse.ArgumentParser()
 parser.add_argument("fasta")
 args = parser.parse_args()

 # Open FASTA, search for masked regions, print in GFF3 format
 with open(args.fasta) as handle:
     for record in SeqIO.parse(handle, "fasta"):
         for match in re.finditer('N+', str(record.seq)):
             print record.id, ".", "assembly gap", match.start(), match.end(), ".", ".", ".", ".", sep='\t')

 #use the following at CMD: FILENAME.py FILENAME.fasta >> FILENAME.gff3  here

ADD REPLY
9
Entering edit mode
8.7 years ago
James Ashmore ★ 3.4k

If you have BioPython installed you could use this small Python script to print such regions in BED3 format.

#!/usr/bin/env python3

# Import necessary packages
import argparse
import re
from Bio import SeqIO

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument("fasta")
args = parser.parse_args()

# Open FASTA, search for masked regions, print in BED3 format
with open(args.fasta) as handle:
    for record in SeqIO.parse(handle, "fasta"):
        for match in re.finditer('N+', str(record.seq)):
            print(record.id, match.start(), match.end(), sep='\t')
ADD COMMENT
0
Entering edit mode

Excellent, this works really well! Since I am not a python expert, what do I need to replace "with open('mm10.fa') as handle:" so that I can simply issue python FindN.py FASTA-file.fasta? Thanks again! I can also use awk '($3-$2) >= 50' x.bed to filter based on how long I want intervals to be.

ADD REPLY
0
Entering edit mode

I've edited the code in my answer so that you can specify the FASTA file on the command-line.

ADD REPLY
0
Entering edit mode

Excellent, works! Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6