Question: Command or Perl script for changing headers of multiple FASTA files in a specific order listed in a txt file
0
gravatar for adeenahassan77
10 days ago by
adeenahassan7730 wrote:

Assalam o alaikum everyone,

I am working with multiple genes and in each gene folder i have multiple FASTA (70-75) files and each FASTA file contains single gene sequence. e.g.

AMY2b_Gene_folder

Chimpanzee_AMY2B_CDS.fasta
Human_AMY2B_CDS.fasta
Pygmy_chimpanzee_AMY2B_CDS.fasta
Western_gorrila_AMY2B_CDS.fasta

cat Chimpanzee_AMY2B_CDS.fasta

>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Human_AMY2B_CDS.fasta

 >lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
> ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
> GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
> AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I want to change headers of each fasta file according to a specific order given in text file.

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

The output should be look like

>MP.C_AMY2B
 ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I have tried perl script given in following biostar posts but these scripts did not worked for multiple FASTA files which have single gene sequence.

Renaming Entries In A Fasta File

Renaming fasta headers according to a matching name list

Kindly guide me is there any command-line solution to do so????

ADD COMMENTlink modified 10 days ago by cpad01121.9k • written 10 days ago by adeenahassan7730
1

Without the mapping rule of names in Headers.txt and the FASTA files, we can't rename them rightly.

Chimpanzee_AMY2B_CDS.fasta                 MP.C_AMY2B
Human_AMY2B_CDS.fasta                      FP.H_AMY2B
Pygmy_chimpanzee_AMY2B_CDS.fasta           ????
Western_gorrila_AMY2B_CDS.fasta            ?????
ADD REPLYlink modified 10 days ago • written 10 days ago by shenwei3563.3k
2
gravatar for cpad0112
10 days ago by
cpad01121.9k
cpad01121.9k wrote:

Run this command:

$ for i in $(sed -n '=' headers.txt); do sed -n "$i"p headers.txt| sed 's/^/>/'; cat $(ls *.fasta| sed -n "$i"p)| sed '1d'; done

output:

>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>FP.H_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Assumptions

  1. Order fasta sequences are in the same order as the headers to be replaced in headers.txt
  2. User default shell is bash

Note: Output to the screen. You can redirect to another fasta file.

Input:

$ cat headers.txt 
MP.C_AMY2B
FP.H_AMY2B

$ cat *.fasta
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENTlink modified 10 days ago • written 10 days ago by cpad01121.9k

Thank u so much, your solution is useful. i want to redirect output in the multiple input fasta files.

ADD REPLYlink written 10 days ago by adeenahassan7730

create a folder called test (in the same folder where fasta files are located) and run the command:

$ mkdir test

execute:

for i in $(seq 1 $(ls *.fasta |wc -l)); do sed -n "$i"p headers.txt| sed 's/^/>/'> test/$(ls *.fasta| sed -n "$i"p); cat $(ls *.fasta| sed -n "$i"p)| sed '1d' >>test/$(ls *.fasta| sed -n "$i"p); done

files in test folder will have exact names as fasta files in current directory. Let me know if there are any issues with the script.

ADD REPLYlink modified 9 days ago • written 9 days ago by cpad01121.9k

Thank u so much it's working.. You made my day :)

ADD REPLYlink written 9 days ago by adeenahassan7730
1
gravatar for tiago211287
10 days ago by
tiago211287760
Brazil
tiago211287760 wrote:

First, create a backup of all files before trying this approach.

Taking a list of new headers, namely, Headers.txt

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

in the same order of one fasta files list

find . -maxdepth 1 -name "*CDS.fasta" | sort
./Chimpanzee_AMY2B_CDS.fasta
./Human_AMY2B_CDS.fasta

put the following function in your linux enviroment:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

create variables of : list of new headers (LIST1) list of input files (LIST2)

LIST1=($(cat Headers.txt))
LIST2=($(find /folder/with/fasta/files/ -maxdepth 0 -name "*CDS.fasta" | sort))

parallel --xapply sedinho {1} {2} ::: ${LIST1[@]} ::: ${LIST2[@]}


>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENTlink modified 9 days ago • written 10 days ago by tiago211287760

Hi tiago211287 , Thank you for your reply. I have tried your solution but the following error occurred.

zsh:1: command not found: sedinho
zsh:1: command not found: sedinho
zsh:1: command not found: sedinho

for putting function in my linux environment i have simply copied function in a text file and copied the file in bin folder. but the above error occurred again can you tell me how to fix it ????

ADD REPLYlink written 10 days ago by adeenahassan7730

in order to work you must execute this two lines in your terminal:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

you dont need to put this inside a file

ADD REPLYlink written 10 days ago by tiago211287760

yup,, Firstly i executed above lines in my terminal but same error occurred. :(

ADD REPLYlink modified 10 days ago • written 10 days ago by adeenahassan7730

can you repeat what you're doing and post a screenshot of your screen here?

ADD REPLYlink written 10 days ago by tiago211287760

https://ibb.co/bz2ita![enter image description here][1]

https://ibb.co/dd3v6v![enter image description here][2]

https://ibb.co/e8tnmv

ADD REPLYlink modified 10 days ago • written 10 days ago by adeenahassan7730
1

You can't directly attach files to Biostars posts. Use a free image hosting provider (click on the icon next to "101" one in edit window to get some suggestions) to upload your images and then insert those links here.

ADD REPLYlink modified 10 days ago • written 10 days ago by genomax33k

this is strange to me, it appears that your shell is not exporting the function properly

ADD REPLYlink written 9 days ago by tiago211287760
2

This user appears to be using zsh and that may be the difference. Switching to bash may do the trick.

ADD REPLYlink written 9 days ago by genomax33k
1
gravatar for biolab
10 days ago by
biolab950
biolab950 wrote:

Please check the following commnds. First, run cd AMY2b_Gene_folder/; mkdir new/. Then, run for f in *.fasta; do perl -e '$in=$ARGV[0]=~s/(.+?)\.fasta/$1/r; while (<>){if (/>/) {print ">$in\n"} else {print} }' $f > new/$f; done

ADD COMMENTlink modified 10 days ago • written 10 days ago by biolab950

Hi biolab,

Thank you for your reply. But i have new FASTA headers in a separate text file according to fasta files oder above code replace headers with FASTA file name not with the headers in the headers file. can u guide me where to use headers file???

P.S I'm new in bioinformatics world happy to give more information.

ADD REPLYlink written 10 days ago by adeenahassan7730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1388 users visited in the last hour