Question: headers from multifasta files
0
gravatar for ulises.rodriguez
2.2 years ago by
ulises.rodriguez0 wrote:

I have a folder with multifasta files and I would like to extract the headers from each one of them, I've used the following command in shell

 grep -e ">" *.fasta > prueba_nc.txt

the output looks like it

Adenoviridae_genomas.fasta:>AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
Adenoviridae_genomas.fasta:>AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
Adenoviridae_genomas.fasta:>AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

...

and I would like to extract only the fragment after the ">"

sequence • 1.2k views
ADD COMMENTlink modified 2.2 years ago by lakhujanivijay5.1k • written 2.2 years ago by ulises.rodriguez0
1

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 2.2 years ago by WouterDeCoster44k
1

Did you upvote just my answer? Please validate the other answers and provide feedback, if at all possible. Muchas gracias.

ADD REPLYlink written 2.2 years ago by Kevin Blighe63k

You are absolutely right Kevin, I have upvoted it ;-)

ADD REPLYlink written 2.2 years ago by Hugo290

Okay, thanks. Did you look at the other solutions by the others?

ADD REPLYlink written 2.2 years ago by Kevin Blighe63k

Hi Ulises, although this can be easily achieved as other people has already explained, if you are working with FASTA files you may be interested in SEDA (http://www.sing-group.org/seda/). Please, take a look and feel free to contact us if you need some assistance using it. Regards.

ADD REPLYlink written 2.2 years ago by Hugo290
2
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

Prueba del éxito:

cat and cut

cat prueba_nc.txt | cut -f2 -d'>'
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

cut

   cut -f2 -d'>' prueba_nc.txt
   AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
   AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
   AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

AWK

awk '{print $2}' FS=">" prueba_nc.txt
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe63k
2
gravatar for WouterDeCoster
2.2 years ago by
Belgium
WouterDeCoster44k wrote:

So you want to get rid of the filename?

In that case, use the -h/--no-filename option of grep.

Or you also want to get rid of the >? You could pipe the grep to sed, e.g.:

grep -he ">" *.fasta | sed 's/^>//' > prueba_nc.txt
ADD COMMENTlink written 2.2 years ago by WouterDeCoster44k
$ sed -n '/>/ s/>//p' *.fa

also works.

ADD REPLYlink written 2.2 years ago by cpad011213k
2
gravatar for st.ph.n
2.2 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:
for file in *.fasta; do grep -e '>' | cut -f 2 -d '>' > "`basename .fasta`.headers.txt"; done

Will take each file with .fasta extension in your cwd and grep for the headers, and cut the headers and take everything after the '>' and place them into a file named with the prefix from the original fasta file now with the extension .headers.txt

ADD COMMENTlink written 2.2 years ago by st.ph.n2.5k
1
gravatar for lakhujanivijay
2.2 years ago by
lakhujanivijay5.1k
India
lakhujanivijay5.1k wrote:

Using seqkit

seqkit fx2tab -in *.fa > headers.txt
ADD COMMENTlink written 2.2 years ago by lakhujanivijay5.1k
$ seqkit seq -n test.fa

also works

ADD REPLYlink written 2.2 years ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour