Question: Rename FASTA headers based on filename
0
gravatar for SaltedPork
25 days ago by
SaltedPork100
SaltedPork100 wrote:

Hi

FASTA header looks like:

>1570-13.segment.flu1_PB2
>1570-13.segment.flu2_PB1
>1570-13.segment.flu3_PA

etc

Filenames looks like:

201301234.fasta

I want to have FASTA headers that looks like:

>201301234_PB2
>201301234_PB1
>201301234_PA

I have seen this answer: Change header of a Fasta file according to the file name How can I modify this to preserve the _PB2...?

bash • 145 views
ADD COMMENTlink modified 25 days ago by SMK1.3k • written 25 days ago by SaltedPork100
2
gravatar for SMK
25 days ago by
SMK1.3k
Ghent, Belgium
SMK1.3k wrote:

Hi SaltedPork,

Try:

for i in $(ls *.fasta); do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done

Edited (better use *.fasta, see response from RamRS):

for i in *.fasta; do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done
ADD COMMENTlink modified 25 days ago • written 25 days ago by SMK1.3k

I'd recommend for i in *.fasta instead of for i in $(ls *.fasta) - the latter adds a sub-shell where a glob would suffice. Plus, ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too.

ADD REPLYlink modified 25 days ago • written 25 days ago by RamRS21k

Thanks, RamRS!

ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too

Can you give some examples of this?

ADD REPLYlink modified 25 days ago • written 25 days ago by SMK1.3k

I have a heavily customized shell. My ls is an example. My LSCOLORS setting interferes with the filename here. See sample output:

➜ for f in $(ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: cannot open `\033[0m\033[38;5;9mhs37d5_GRCm38p6.fasta.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gzip.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gz\033[0m' (No such file or directory)
: cannot open `\033[m' (No such file or directory)

➜ for f in $(/bin/ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

➜ for f in *.gz
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

With respect to filenames causing a problem, if filenames contained white spaces, $(ls) would pass that as separate inputs whereas * would glob it as one with the spaces escaped. See below:

➜ touch a "b c"

➜ for f in $(/bin/ls *)
> file $f

a: empty
b: cannot open `b' (No such file or directory)
c: cannot open `c' (No such file or directory)

➜ for f in *
> file $f

a: empty
b c: empty
ADD REPLYlink modified 25 days ago • written 25 days ago by RamRS21k

I see. Good points that I didn't think about. Thanks, RamRS.

ADD REPLYlink written 25 days ago by SMK1.3k
1
gravatar for Vijay Lakhujani
25 days ago by
Vijay Lakhujani4.1k
India
Vijay Lakhujani4.1k wrote:

Using seqkit : replace

seqkit replace -p '.segment.flu1' -r '' <your_fasta_file>

Explanation

 replace = name/sequence by regular expression.

-p, --pattern string         search regular expression
-r, --replacement string     replacement. supporting capture variables
ADD COMMENTlink modified 25 days ago • written 25 days ago by Vijay Lakhujani4.1k
1
gravatar for bari.ballew
25 days ago by
bari.ballew90
USA/NIH
bari.ballew90 wrote:

Just using bash:

for i in *fasta; do n="${i%.fasta}"; sed -i.bak "s/>[^_]\+/>$n/" $i; done

This loops over all files in the current directory that end with "fasta". For each file:

  1. n="${i%.fasta}" removes the .fasta file extension (can be generalized to any extension by using n="${i%.*}")
  2. sed "s/>[^_]\+/>$n/" matches a string in the file that starts with ">" and is followed by any character that's not an underscore, and replaces it with the filename minus extension found in the previous step. Depending on your requirements, you may want to tighten up this regex.
  3. The -i.bak part just tells sed to replace the string in the original file, but make a backup called <originalname>.bak.
ADD COMMENTlink modified 25 days ago • written 25 days ago by bari.ballew90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour