Question: How to extract a fasta sequence row using AWK into new file that has no "_" in it.
0
gravatar for usamaabid3
3.5 years ago by
usamaabid30
usamaabid30 wrote:

cat myseq.fasta | grep '>' | head -n 5;

>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448

The IDs represent a unique identifier; those that have the _ have been grouped by sequence similarity based on BLAST results. For example,PZ7180000000447_A and PZ15399_A are in the same group, whilePZ7180000000079_AF and PZ5729_AF are together in a different group. Those without any _ suffix had no homology to other sequences, and so weren’t put in a group.

I want to put non-group in file named no_suffix.fasta

I want to do this all using AWK, Grep and SED. Please help

awk sequence fasta • 1.4k views
ADD COMMENTlink modified 3.5 years ago by Daniel3.7k • written 3.5 years ago by usamaabid30

Why only awk, sed and grep? Is this an assignment?

ADD REPLYlink written 3.5 years ago by RamRS22k

No I have studied those I also understand bioawk... I tried python but it didn't worked for me was tough :(

ADD REPLYlink written 3.5 years ago by usamaabid30

It might be a good idea to use bioawk then. You should not have to design yet another parser for an ad hoc task.

ADD REPLYlink written 3.5 years ago by RamRS22k
2
gravatar for Pierre Lindenbaum
3.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

linearize the fasta ( see ) , filter out the sequences containing a '_' using `grep -v '_'  , convert back to fasta and redirect to no_suffix.fasta

ADD COMMENTlink written 3.5 years ago by Pierre Lindenbaum121k

Why linearize, Pierre?

ADD REPLYlink written 3.5 years ago by RamRS22k
2
gravatar for Matt Shirley
3.5 years ago by
Matt Shirley9.0k
Cambridge, MA
Matt Shirley9.0k wrote:
$ pip install pyfaidx
$ faidx --regex [^_] input.fasta > output.fasta
ADD COMMENTlink written 3.5 years ago by Matt Shirley9.0k
1
gravatar for RamRS
3.5 years ago by
RamRS22k
Houston, TX
RamRS22k wrote:

This is simple text processing, not bioinformatics, but I'll let the other mods decide if this question should remain open. Essentially, you can use bioawk to parse the fasta. If you do not wish to use any intelligent parser (which is a bad decision), use grep with '^>[^_]+' to get the relevant headers, and use it to get your sequences.

Or, use an existing parser so you can compare the header and copy over the sequence in a single step without overly complicating your command.

ADD COMMENTlink written 3.5 years ago by RamRS22k
0
gravatar for Daniel
3.5 years ago by
Daniel3.7k
Cardiff University
Daniel3.7k wrote:

I think this is much simpler than above, using the -A flag for "number of lines after match". If your fasta is already in two line format (no newlines within the sequence) just do:

grep '_' -A 1 myfile.fasta >groups_only.fasta

grep -v '_' -A 1 myfile.fasta >no_groups.fasta

If you need to convert a multi-line fasta into a single-line fasta, I use this simple script:

#!/usr/bin/perl
$filename = $ARGV[0];
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}
ADD COMMENTlink written 3.5 years ago by Daniel3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1583 users visited in the last hour