Question: Remove unwanted characters from FASTA file
gravatar for seta
4.5 years ago by
seta1.2k wrote:

Hi everybody,

I have some nucleotide sequence (fasta format) that contains many sequences with invalid characters, like  'e', 'q', 'i' and 'l',  there is also a item named "unavailable sequence" for some header. These characters were reported me as a error during working with MEME-chip. Could anybody please let me know how to find these characters and also the item "unavailable sequence" and remove them? Any suggestion and command warmly welcomed.

rna-seq next-gen alignment • 6.9k views
ADD COMMENTlink written 4.5 years ago by seta1.2k
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

cleanup fasta :

 sed -e '/^[^>]/s/[^ATGCatgc]/N/g' file.fa
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Pierre Lindenbaum121k

Many thanks and so sorry for so basic question. However, I have no familiarity with the above command, is it possible please let me know which unwanted characters will be removed with the command, and how about the removing the header and its content that is actually noting and specify with "unavailable sequence"? Thanks

ADD REPLYlink written 4.5 years ago by seta1.2k

The sed command replaces all non ATCG/atcg characters with 'N' on non-header lines while ignoring header lines. Read sed commands so:

sed -CMD_LINE_OPTIONS '/<pattern_that_matches_line_to_process>/<operation>/<text_or_expression_to_be_replaced>/<new_text_or_expression>/<options>' INPUT_FILE
ADD REPLYlink written 4.5 years ago by RamRS22k

Thanks a lot for your clear clarification. As I mentioned there are some item as "sequence unavailable" under the related header, like " >AT1G01740|AT1G01740.1 Sequence unavailable", using the sed command it changes to ">AT1G01740|AT1G01740.1 NNNNNNcNNNNaNaNNaNNN", could you please share me how to remove these header with its contents, that actually is noting? Many thansk for your helpful commands

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by seta1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 578 users visited in the last hour