Question

How to extract a fasta sequence row using AWK into new file that has no "_" in it.

0

Entering edit mode

8.4 years ago

usamaabid3 • 0

cat myseq.fasta | grep '>' | head -n 5

>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448

The IDs represent a unique identifier; those that have the _ have been grouped by sequence similarity based on BLAST results. For example,PZ7180000000447_A and PZ15399_A are in the same group, whilePZ7180000000079_AF and PZ5729_AF are together in a different group. Those without any _ suffix had no homology to other sequences, and so weren't put in a group.

I want to put non-group in file named no_suffix.fasta

I want to do this all using AWK, Grep and SED. Please help

sequence awk fasta • 3.3k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by usamaabid3 • 0

0

Entering edit mode

Why only awk, sed and grep? Is this an assignment?

ADD REPLY • link 8.4 years ago by Ram 43k

0

Entering edit mode

No I have studied those I also understand bioawk... I tried python but it didn't worked for me was tough :(

ADD REPLY • link 8.4 years ago by usamaabid3 • 0

0

Entering edit mode

It might be a good idea to use bioawk then. You should not have to design yet another parser for an ad hoc task.

ADD REPLY • link 8.4 years ago by Ram 43k

1

Entering edit mode

8.4 years ago

Ram 43k

This is simple text processing, not bioinformatics, but I'll let the other mods decide if this question should remain open. Essentially, you can use bioawk to parse the fasta. If you do not wish to use any intelligent parser (which is a bad decision), use grep with '^>[^_]+' to get the relevant headers, and use it to get your sequences.

Or, use an existing parser so you can compare the header and copy over the sequence in a single step without overly complicating your command.

ADD COMMENT • link 4.4 years ago by Ram 43k

0

Entering edit mode

8.4 years ago

Daniel ★ 4.0k

I think this is much simpler than above, using the -A flag for "number of lines after match". If your fasta is already in two line format (no newlines within the sequence) just do:

grep '_' -A 1 myfile.fasta >groups_only.fasta
grep -v '_' -A 1 myfile.fasta >no_groups.fasta

If you need to convert a multi-line fasta into a single-line fasta, I use this simple script:

#!/usr/bin/perl
$filename = $ARGV[0];
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Daniel ★ 4.0k

Ram · Accepted Answer · 2015-12-13

2

Entering edit mode

8.4 years ago

Pierre Lindenbaum 161k

linearize the fasta (see here) , filter out the sequences containing a _ using grep -v '_', convert back to fasta and redirect to no_suffix.fasta

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Why linearize, Pierre?

ADD REPLY • link 8.4 years ago by Ram 43k

Ram · Accepted Answer · 2015-12-13

2

Entering edit mode

8.4 years ago

Matt Shirley 10k

$ pip install pyfaidx
$ faidx --regex [^_] input.fasta > output.fasta

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Matt Shirley 10k