Question: Rename FASTA headers based on filename
0
gravatar for SaltedPork
12 months ago by
SaltedPork110
SaltedPork110 wrote:

Hi

FASTA header looks like:

>1570-13.segment.flu1_PB2
>1570-13.segment.flu2_PB1
>1570-13.segment.flu3_PA

etc

Filenames looks like:

201301234.fasta

I want to have FASTA headers that looks like:

>201301234_PB2
>201301234_PB1
>201301234_PA

I have seen this answer: Change header of a Fasta file according to the file name How can I modify this to preserve the _PB2...?

bash • 353 views
ADD COMMENTlink modified 12 months ago by SMK1.9k • written 12 months ago by SaltedPork110
2
gravatar for SMK
12 months ago by
SMK1.9k
SMK1.9k wrote:

Hi SaltedPork,

Try:

for i in $(ls *.fasta); do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done

Edited (better use *.fasta, see response from RamRS):

for i in *.fasta; do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done
ADD COMMENTlink modified 12 months ago • written 12 months ago by SMK1.9k

I'd recommend for i in *.fasta instead of for i in $(ls *.fasta) - the latter adds a sub-shell where a glob would suffice. Plus, ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too.

ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS27k

Thanks, RamRS!

ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too

Can you give some examples of this?

ADD REPLYlink modified 12 months ago • written 12 months ago by SMK1.9k

I have a heavily customized shell. My ls is an example. My LSCOLORS setting interferes with the filename here. See sample output:

➜ for f in $(ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: cannot open `\033[0m\033[38;5;9mhs37d5_GRCm38p6.fasta.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gzip.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gz\033[0m' (No such file or directory)
: cannot open `\033[m' (No such file or directory)

➜ for f in $(/bin/ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

➜ for f in *.gz
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

With respect to filenames causing a problem, if filenames contained white spaces, $(ls) would pass that as separate inputs whereas * would glob it as one with the spaces escaped. See below:

➜ touch a "b c"

➜ for f in $(/bin/ls *)
> file $f

a: empty
b: cannot open `b' (No such file or directory)
c: cannot open `c' (No such file or directory)

➜ for f in *
> file $f

a: empty
b c: empty
ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS27k

I see. Good points that I didn't think about. Thanks, RamRS.

ADD REPLYlink written 12 months ago by SMK1.9k
1
gravatar for lakhujanivijay
12 months ago by
lakhujanivijay5.0k
India
lakhujanivijay5.0k wrote:

Using seqkit : replace

seqkit replace -p '.segment.flu1' -r '' <your_fasta_file>

Explanation

 replace = name/sequence by regular expression.

-p, --pattern string         search regular expression
-r, --replacement string     replacement. supporting capture variables
ADD COMMENTlink modified 12 months ago • written 12 months ago by lakhujanivijay5.0k
1
gravatar for bari.ballew
12 months ago by
bari.ballew220
USA/NIH
bari.ballew220 wrote:

Just using bash:

for i in *fasta; do n="${i%.fasta}"; sed -i.bak "s/>[^_]\+/>$n/" $i; done

This loops over all files in the current directory that end with "fasta". For each file:

  1. n="${i%.fasta}" removes the .fasta file extension (can be generalized to any extension by using n="${i%.*}")
  2. sed "s/>[^_]\+/>$n/" matches a string in the file that starts with ">" and is followed by any character that's not an underscore, and replaces it with the filename minus extension found in the previous step. Depending on your requirements, you may want to tighten up this regex.
  3. The -i.bak part just tells sed to replace the string in the original file, but make a backup called <originalname>.bak.
ADD COMMENTlink modified 12 months ago • written 12 months ago by bari.ballew220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 906 users visited in the last hour