Question: headers from multifasta files
0
gravatar for ulises.rodriguez
7 months ago by
ulises.rodriguez0 wrote:

I have a folder with multifasta files and I would like to extract the headers from each one of them, I've used the following command in shell

 grep -e ">" *.fasta > prueba_nc.txt

the output looks like it

Adenoviridae_genomas.fasta:>AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
Adenoviridae_genomas.fasta:>AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
Adenoviridae_genomas.fasta:>AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

...

and I would like to extract only the fragment after the ">"

sequence • 384 views
ADD COMMENTlink modified 7 months ago by Vijay Lakhujani3.4k • written 7 months ago by ulises.rodriguez0
1

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 7 months ago by WouterDeCoster35k
1

Did you upvote just my answer? Please validate the other answers and provide feedback, if at all possible. Muchas gracias.

ADD REPLYlink written 7 months ago by Kevin Blighe33k

You are absolutely right Kevin, I have upvoted it ;-)

ADD REPLYlink written 7 months ago by Hugo140

Okay, thanks. Did you look at the other solutions by the others?

ADD REPLYlink written 7 months ago by Kevin Blighe33k

Hi Ulises, although this can be easily achieved as other people has already explained, if you are working with FASTA files you may be interested in SEDA (http://www.sing-group.org/seda/). Please, take a look and feel free to contact us if you need some assistance using it. Regards.

ADD REPLYlink written 7 months ago by Hugo140
2
gravatar for Kevin Blighe
7 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

Prueba del éxito:

cat and cut

cat prueba_nc.txt | cut -f2 -d'>'
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

cut

   cut -f2 -d'>' prueba_nc.txt
   AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
   AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
   AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

AWK

awk '{print $2}' FS=">" prueba_nc.txt
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.
ADD COMMENTlink modified 7 months ago • written 7 months ago by Kevin Blighe33k
2
gravatar for WouterDeCoster
7 months ago by
Belgium
WouterDeCoster35k wrote:

So you want to get rid of the filename?

In that case, use the -h/--no-filename option of grep.

Or you also want to get rid of the >? You could pipe the grep to sed, e.g.:

grep -he ">" *.fasta | sed 's/^>//' > prueba_nc.txt
ADD COMMENTlink written 7 months ago by WouterDeCoster35k
$ sed -n '/>/ s/>//p' *.fa

also works.

ADD REPLYlink written 7 months ago by cpad011210k
2
gravatar for st.ph.n
7 months ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:
for file in *.fasta; do grep -e '>' | cut -f 2 -d '>' > "`basename .fasta`.headers.txt"; done

Will take each file with .fasta extension in your cwd and grep for the headers, and cut the headers and take everything after the '>' and place them into a file named with the prefix from the original fasta file now with the extension .headers.txt

ADD COMMENTlink written 7 months ago by st.ph.n2.4k
1
gravatar for Vijay Lakhujani
7 months ago by
Vijay Lakhujani3.4k
India
Vijay Lakhujani3.4k wrote:

Using seqkit

seqkit fx2tab -in *.fa > headers.txt
ADD COMMENTlink written 7 months ago by Vijay Lakhujani3.4k
$ seqkit seq -n test.fa

also works

ADD REPLYlink written 7 months ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1773 users visited in the last hour