Question: Filter contigs by size: Different output between quast report and python output
0
gravatar for Dave Th
13 months ago by
Dave Th10
Dave Th10 wrote:

Hi all,

I'm trying to filter my contigs dataset into different files by their length such as 500bp, 1kb, 2kb... I'm using below code to produce my output.

def contigs_filter_by_length(fasta_input, size, fasta_output):
long_contigs =  [] #Create an empty list
for record in SeqIO.parse(fasta_input,"fasta"):
    if len(record.seq) >= size:
        long_contigs.append(record)
print("Found %i contigs" %len(long_contigs))
SeqIO.write(long_contigs,fasta_output,"fasta")

The problem is when I crosschecked with QUAST report of my input file and the output from the code, there was a huge difference between them. QUAST indicated that there are 119787 contigs >= 500bp while the fasta output from the code showed 122046 contigs >=500bp.

Is there anything wrong in my code which lead to this difference?

sequence assembly • 380 views
ADD COMMENTlink written 13 months ago by Dave Th10

I haven't seen anything wrong in your code, have you compared the results? You can find some contigs reported by your python code while not by QUAST to see what caused the difference

ADD REPLYlink written 13 months ago by yztxwd380

I think this might be the key.

QUAST may be doing some additional filtering of 'junk' sequences which are obvious misassembly artefacts or deduplication.

Not 100% for certain, but that would be my immediate guess.

ADD REPLYlink written 13 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1856 users visited in the last hour