Question: Software for cleaning FASTA files
0
gravatar for l.souza
2.7 years ago by
l.souza70
Brasilia, Brazil
l.souza70 wrote:

What software can I use to remove sequences:

  • with a lot of unkown nucleotides (N)
  • duplicated
  • too short

?

sequences alignment cleaning • 844 views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by l.souza70
2

Hi Lucas Souza,

This question contains insufficient information to get answered. Please elaborate, for example on the technology used for generating the data. We are really bad at reading your mind (that's a bug in biostars which will get fixed in a next release). So for now, put some effort in your question and don't make it hard for people trying to help you.

Cheers,
Wouter

ADD REPLYlink written 2.7 years ago by WouterDeCoster43k

Edited. Is it better?

ADD REPLYlink written 2.7 years ago by l.souza70

Obviously, all information you just added is vital for answering the question. By now we have lost 9 hours since you asked the question. Remember this for further questions that it's important to be as informative as possible.

ADD REPLYlink written 2.7 years ago by WouterDeCoster43k

Must you use entire genomes? Are there particular genes that are perhaps informative enough?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax78k

The entire genome produces a single polyprotein that is cleaved in post-translational processes. For this reason I think it would be better to use the entire genome. But I may be wrong with this thinking

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by l.souza70

Lucas Souza : Please do not change the entire contents of an original question. That makes the chain of responses/comments here meaningless. Consider this a fair warning. I will tag @Istvan to see if he can restore the original question.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax78k

Tagging: Istvan Albert

ADD REPLYlink written 2.7 years ago by genomax78k
2
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Hi Lucas,

You can accomplish most of what you want using the BBMap package.

Filtering Ns and short sequences:

bbduk.sh in=file.fasta out=filtered.fasta maxns=50 minlen=1000

Removing exact duplicates and containments:

dedupe.sh in=file.fasta out=deduped.fasta

Unfortunately BBDuk does not have a mechanism for removing sequences based on the % of Ns, just the absolute number, but I may add that.

ADD COMMENTlink written 2.7 years ago by Brian Bushnell17k
1

Wait, there is a task that none of the BB-Tools can do? :)

ADD REPLYlink written 2.7 years ago by finswimmer13k

Geez, like I said, I may add it :) I already wrote myself a note!

ADD REPLYlink written 2.7 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 970 users visited in the last hour