Question: headers from multifasta files
0
gravatar for ulises.rodriguez
10 months ago by
ulises.rodriguez0 wrote:

I have a folder with multifasta files and I would like to extract the headers from each one of them, I've used the following command in shell

 grep -e ">" *.fasta > prueba_nc.txt

the output looks like it

Adenoviridae_genomas.fasta:>AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
Adenoviridae_genomas.fasta:>AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
Adenoviridae_genomas.fasta:>AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

...

and I would like to extract only the fragment after the ">"

sequence • 521 views
ADD COMMENTlink modified 10 months ago by bioExplorer3.7k • written 10 months ago by ulises.rodriguez0
1

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 10 months ago by WouterDeCoster37k
1

Did you upvote just my answer? Please validate the other answers and provide feedback, if at all possible. Muchas gracias.

ADD REPLYlink written 10 months ago by Kevin Blighe39k

You are absolutely right Kevin, I have upvoted it ;-)

ADD REPLYlink written 10 months ago by Hugo150

Okay, thanks. Did you look at the other solutions by the others?

ADD REPLYlink written 10 months ago by Kevin Blighe39k

Hi Ulises, although this can be easily achieved as other people has already explained, if you are working with FASTA files you may be interested in SEDA (http://www.sing-group.org/seda/). Please, take a look and feel free to contact us if you need some assistance using it. Regards.

ADD REPLYlink written 10 months ago by Hugo150
2
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

Prueba del éxito:

cat and cut

cat prueba_nc.txt | cut -f2 -d'>'
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

cut

   cut -f2 -d'>' prueba_nc.txt
   AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
   AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
   AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

AWK

awk '{print $2}' FS=">" prueba_nc.txt
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.
ADD COMMENTlink modified 10 months ago • written 10 months ago by Kevin Blighe39k
2
gravatar for WouterDeCoster
10 months ago by
Belgium
WouterDeCoster37k wrote:

So you want to get rid of the filename?

In that case, use the -h/--no-filename option of grep.

Or you also want to get rid of the >? You could pipe the grep to sed, e.g.:

grep -he ">" *.fasta | sed 's/^>//' > prueba_nc.txt
ADD COMMENTlink written 10 months ago by WouterDeCoster37k
$ sed -n '/>/ s/>//p' *.fa

also works.

ADD REPLYlink written 10 months ago by cpad011211k
2
gravatar for st.ph.n
10 months ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:
for file in *.fasta; do grep -e '>' | cut -f 2 -d '>' > "`basename .fasta`.headers.txt"; done

Will take each file with .fasta extension in your cwd and grep for the headers, and cut the headers and take everything after the '>' and place them into a file named with the prefix from the original fasta file now with the extension .headers.txt

ADD COMMENTlink written 10 months ago by st.ph.n2.4k
1
gravatar for bioExplorer
10 months ago by
bioExplorer3.7k
bioExplorer3.7k wrote:

Using seqkit

seqkit fx2tab -in *.fa > headers.txt
ADD COMMENTlink written 10 months ago by bioExplorer3.7k
$ seqkit seq -n test.fa

also works

ADD REPLYlink written 10 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2490 users visited in the last hour