Question: Command or Perl script for changing headers of multiple FASTA files in a specific order listed in a txt file
0
gravatar for adeena_hassan
10 weeks ago by
adeena_hassan40 wrote:

Assalam o alaikum everyone,

I am working with multiple genes and in each gene folder i have multiple FASTA (70-75) files and each FASTA file contains single gene sequence. e.g.

AMY2b_Gene_folder

Chimpanzee_AMY2B_CDS.fasta
Human_AMY2B_CDS.fasta
Pygmy_chimpanzee_AMY2B_CDS.fasta
Western_gorrila_AMY2B_CDS.fasta

cat Chimpanzee_AMY2B_CDS.fasta

>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Human_AMY2B_CDS.fasta

 >lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
> ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
> GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
> AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I want to change headers of each fasta file according to a specific order given in text file.

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

The output should be look like

>MP.C_AMY2B
 ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I have tried perl script given in following biostar posts but these scripts did not worked for multiple FASTA files which have single gene sequence.

Renaming Entries In A Fasta File

Renaming fasta headers according to a matching name list

Kindly guide me is there any command-line solution to do so????

ADD COMMENTlink modified 10 weeks ago by cpad01123.1k • written 10 weeks ago by adeena_hassan40
1

Without the mapping rule of names in Headers.txt and the FASTA files, we can't rename them rightly.

Chimpanzee_AMY2B_CDS.fasta                 MP.C_AMY2B
Human_AMY2B_CDS.fasta                      FP.H_AMY2B
Pygmy_chimpanzee_AMY2B_CDS.fasta           ????
Western_gorrila_AMY2B_CDS.fasta            ?????
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by shenwei3563.4k
2
gravatar for cpad0112
10 weeks ago by
cpad01123.1k
cpad01123.1k wrote:

Run this command:

$ for i in $(sed -n '=' headers.txt); do sed -n "$i"p headers.txt| sed 's/^/>/'; cat $(ls *.fasta| sed -n "$i"p)| sed '1d'; done

output:

>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>FP.H_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Assumptions

  1. Order fasta sequences are in the same order as the headers to be replaced in headers.txt
  2. User default shell is bash

Note: Output to the screen. You can redirect to another fasta file.

Input:

$ cat headers.txt 
MP.C_AMY2B
FP.H_AMY2B

$ cat *.fasta
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by cpad01123.1k

Thank u so much, your solution is useful. i want to redirect output in the multiple input fasta files.

ADD REPLYlink written 10 weeks ago by adeena_hassan40
1

create a folder called test (in the same folder where fasta files are located) and run the command:

$ mkdir test

execute:

for i in $(seq 1 $(ls *.fasta |wc -l)); do sed -n "$i"p headers.txt| sed 's/^/>/'> test/$(ls *.fasta| sed -n "$i"p); cat $(ls *.fasta| sed -n "$i"p)| sed '1d' >>test/$(ls *.fasta| sed -n "$i"p); done

files in test folder will have exact names as fasta files in current directory. Let me know if there are any issues with the script.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by cpad01123.1k

Thank u so much it's working.. You made my day :)

ADD REPLYlink written 10 weeks ago by adeena_hassan40
1
gravatar for tiago211287
10 weeks ago by
tiago211287790
Brazil
tiago211287790 wrote:

First, create a backup of all files before trying this approach.

Taking a list of new headers, namely, Headers.txt

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

in the same order of one fasta files list

find . -maxdepth 1 -name "*CDS.fasta" | sort
./Chimpanzee_AMY2B_CDS.fasta
./Human_AMY2B_CDS.fasta

put the following function in your linux enviroment:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

create variables of : list of new headers (LIST1) list of input files (LIST2)

LIST1=($(cat Headers.txt))
LIST2=($(find /folder/with/fasta/files/ -maxdepth 0 -name "*CDS.fasta" | sort))

parallel --xapply sedinho {1} {2} ::: ${LIST1[@]} ::: ${LIST2[@]}


>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by tiago211287790

Hi tiago211287 , Thank you for your reply. I have tried your solution but the following error occurred.

zsh:1: command not found: sedinho
zsh:1: command not found: sedinho
zsh:1: command not found: sedinho

for putting function in my linux environment i have simply copied function in a text file and copied the file in bin folder. but the above error occurred again can you tell me how to fix it ????

ADD REPLYlink written 10 weeks ago by adeena_hassan40

in order to work you must execute this two lines in your terminal:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

you dont need to put this inside a file

ADD REPLYlink written 10 weeks ago by tiago211287790

yup,, Firstly i executed above lines in my terminal but same error occurred. :(

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by adeena_hassan40

can you repeat what you're doing and post a screenshot of your screen here?

ADD REPLYlink written 10 weeks ago by tiago211287790

https://ibb.co/bz2ita![enter image description here][1]

https://ibb.co/dd3v6v![enter image description here][2]

https://ibb.co/e8tnmv

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by adeena_hassan40
1

You can't directly attach files to Biostars posts. Use a free image hosting provider (click on the icon next to "101" one in edit window to get some suggestions) to upload your images and then insert those links here.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by genomax37k

this is strange to me, it appears that your shell is not exporting the function properly

ADD REPLYlink written 10 weeks ago by tiago211287790
2

This user appears to be using zsh and that may be the difference. Switching to bash may do the trick.

ADD REPLYlink written 10 weeks ago by genomax37k
1
gravatar for biolab
10 weeks ago by
biolab970
biolab970 wrote:

Please check the following commnds. First, run cd AMY2b_Gene_folder/; mkdir new/. Then, run for f in *.fasta; do perl -e '$in=$ARGV[0]=~s/(.+?)\.fasta/$1/r; while (<>){if (/>/) {print ">$in\n"} else {print} }' $f > new/$f; done

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by biolab970

Hi biolab,

Thank you for your reply. But i have new FASTA headers in a separate text file according to fasta files oder above code replace headers with FASTA file name not with the headers in the headers file. can u guide me where to use headers file???

P.S I'm new in bioinformatics world happy to give more information.

ADD REPLYlink written 10 weeks ago by adeena_hassan40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 703 users visited in the last hour