Question: Renaming fasta headers according to a matching name list
2
gravatar for san.san
2.6 years ago by
san.san80
Wales, UK
san.san80 wrote:

Hi all,

Can anyone advise me on how to rename the header in my fasta file? Say I have a seq.fa file with transcript sequences:

>TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCTTCCTTTCAAAGAA
>TR6|c0_g1_i1
GTGGAATATCGCCAGTGACCATCACTGATTAACCTG

I also have a file with contigs matching the transcripts - names.txt:

TR1|c0_g1_i1    scaf0432344_50037.734_wgs
TR6|c0_g1_i1    scaf0159424_10142.072_wgs

How to I add contig names to the fasta file headers so that the "scaf0..." identifier comes before the "TR..."?

Desired output:

>scaf0432344_50037.734_wgs|TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCTTCCTTTCAAAGAA
>scaf0159424_10142.072_wgs|TR6|c0_g1_i1
GTGGAATATCGCCAGTGACCATCACTGATTAACCTG

Cheers!

ADD COMMENTlink modified 6 months ago by na.poorin0 • written 2.6 years ago by san.san80
3

There are a lot of posts regarding this issue. I'd suggest you to have a look before asking because I think that this topic has been widely covered:

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by iraun3.4k

I have looked for a couple of hours now but didn't manage to find anything matching exactly what I need to do. I'm not trained so things that are not exactly what I need, unfortunately, were not a help to me. Since I don't need to replace the header but rather modify it, this particular post wasn't helpful to me.

ADD REPLYlink written 2.6 years ago by san.san80
1

Did you try anything? If yes post what you've tried which is not working?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by venu5.6k

No, sorry, I have no clue how to go about it.

ADD REPLYlink written 2.6 years ago by san.san80

Your fasta headers look like Trinity output. Did you run Trinity on linux? Or were you given the fasta?

ADD REPLYlink written 2.2 years ago by st.ph.n2.3k
2
gravatar for iraun
2.6 years ago by
iraun3.4k
Norway
iraun3.4k wrote:

Using the solution proposed in replace fasta headers with another name in a text file by @Kenosis just with a little modification, you'll get the desired output:

  use strict;
  use warnings;

    my @arr;

    while (<>) {
        chomp;
        my @a = split /\t/;
        push @arr, $a[1] . "|" . $a[0] if length;
        last if eof; }

    while (<>) {
        print /^>/ ? ">" . shift(@arr) . "\n" : $_; }

Call the script in the following way: perl script.pl textFile fastaFile [>outFile]

I'd strongly suggest you to try to reach the solution yourself, otherwise you'll never learn. And it is recommended to show up your unsuccessful attempts to get the solution, otherwise people can think that you are asking just to avoid the effort of doing it yourself. Even if your attempts are not the best ones (all of us have been through that), I'd suggest you to explain to the community what you've tried.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by iraun3.4k
1

Thank you so much. Unfortunately learning in my own time isn't an option since there's pressure for me to produce results asap. Which really isn't helping anyone in the long run, but I don't have a say in it. But I get your point and next time will explain that I'm not just trying to bum off someone.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by san.san80

I am trying to use above mentioned perl script and modified it and trying to get a desirable output, but I am unable to get it. My input fasta file looks like-

V091201769_1 H648IQM04I6TDD orig_bc=AGCACTGTAG GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAGGAGAAGGAAGCTTGCCT V091209099_2 H648IQM04JTWVX orig_bc=ACGAGTGCGT GATAAACGCTGGCGGCGCACATAAGACATGCAAGTCGAACGAACTTAACCATTAGTTTAC V091201769_3 H648IQM04IRZLN orig_bc=AGCACTGTAG GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACTTTATTTGATTTCCTTCGGGACT

and the text file-

V091200036 TR01555 V091205377 TR01556 V091201769 TR01557 V091209099 TR01558

The expected outcome which i need is-

TR01557_1 H648IQM04I6TDD orig_bc=AGCACTGTAG GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAGGAGAAGGAAGCTTGCCT TR01558_2 H648IQM04JTWVX orig_bc=ACGAGTGCGT GATAAACGCTGGCGGCGCACATAAGACATGCAAGTCGAACGAACTTAACCATTAGTTTAC TR01557_3 H648IQM04IRZLN orig_bc=AGCACTGTAG GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACTTTATTTGATTTCCTTCGGGACT

I need to rename the header name before first underscore, and keep the rest of the header name as such.

Any suggestions?

ADD REPLYlink modified 21 months ago • written 21 months ago by shashank.gupta10
2
gravatar for shenwei356
2.2 years ago by
shenwei3564.1k
China
shenwei3564.1k wrote:

SeqKit supports this case now by subcommand replace, usage, download .

seqs:

$ cat seq.fa 
>TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCTTCCTTTCAAAGAA
>TR6|c0_g1_i1
GTGGAATATCGCCAGTGACCATCACTGATTAACCTG

names.txt should be tab-delimited:

$ cat names.txt 
TR1|c0_g1_i1    scaf0432344_50037.734_wgs
TR6|c0_g1_i1    scaf0159424_10142.072_wgs

replace

$ seqkit replace -p "(.+)" -r '{kv}|$1' -k names.txt seq.fa 
[INFO] read key-value file: names.txt
[INFO] 2 pairs of key-value loaded
>scaf0432344_50037.734_wgs|TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCTTCCTTTCAAAGAA
>scaf0159424_10142.072_wgs|TR6|c0_g1_i1
GTGGAATATCGCCAGTGACCATCACTGATTAACCTG
ADD COMMENTlink written 2.2 years ago by shenwei3564.1k
1
gravatar for san.san
2.6 years ago by
san.san80
Wales, UK
san.san80 wrote:

This is another solution that was proposed to me, if anyone finds it helpful:

awk 'FNR==NR{
  a[">"$1]=$2;next
}
$1 in a{
  sub(/>/,">"a[$1]"|",$1)
}1' names.txt seq.fa
ADD COMMENTlink written 2.6 years ago by san.san80
0
gravatar for shashank.gupta
21 months ago by
shashank.gupta10 wrote:

I am trying to use above mentioned perl script and modified it and trying to get a desirable output, but I am unable to get it. My input fasta file looks like-

>V091201769_1 H648IQM04I6TDD orig_bc=AGCACTGTAG
GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAGGAGAAGGAAGCTTGCCT
>V091209099_2 H648IQM04JTWVX orig_bc=ACGAGTGCGT
GATAAACGCTGGCGGCGCACATAAGACATGCAAGTCGAACGAACTTAACCATTAGTTTAC
>V091201769_3 H648IQM04IRZLN orig_bc=AGCACTGTAG
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACTTTATTTGATTTCCTTCGGGACT

and the text file-

V091200036 TR01555
V091205377 TR01556
V091201769 TR01557
V091209099 TR01558

The expected outcome which i need is-

>TR01557_1 H648IQM04I6TDD orig_bc=AGCACTGTAG
GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAGGAGAAGGAAGCTTGCCT
>TR01558_2 H648IQM04JTWVX orig_bc=ACGAGTGCGT
GATAAACGCTGGCGGCGCACATAAGACATGCAAGTCGAACGAACTTAACCATTAGTTTAC
>TR01557_3 H648IQM04IRZLN orig_bc=AGCACTGTAG
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACTTTATTTGATTTCCTTCGGGACT

I need to rename the header name before first underscore, and keep the rest of the header name as such.

Any suggestions?

ADD COMMENTlink modified 21 months ago • written 21 months ago by shashank.gupta10
2

Use SeqKit:

$ seqkit replace -p "^(.+?)_(\d+)" -r '{kv}_$2' -k names.txt seq.fa
ADD REPLYlink modified 21 months ago • written 21 months ago by shenwei3564.1k

That works. Thanks!

ADD REPLYlink written 21 months ago by shashank.gupta10
0
gravatar for na.poorin
6 months ago by
na.poorin0
na.poorin0 wrote:

Dear Shenwei, As I'm a biologist who is totally new to any programming task would you please tell me what would be the Seqkit's command if I want to do the following.

My input fasta is like

>AFA46815.1 chromosomal replication initiator protein DnaA [Acetobacterium woodii DSM 1030]
MDTSFKKIWDAALEIIKPDISPTSFNTWFLKIKPINYVNNTYYFLSENEFEKGILESRHIPLITNALAEVTGKTGQVKIV
LKEEDAMIAQNADSPNVSITNVEEKISNYQNNSMNPKYTFDSFVIGENNRFAHAACVAVSEAPSERYNPLFIYGGVGLGK
THLMQAIGHYILSYAPHKKVVYVSCEKFTNDFIDAIQNKTNISFRNKYRSADILLIDDIQFLAKKEGTQEEFFHTFNTLH
QENKQIVISSDRPPKEIPTLEERLRSRFESGLITDIYAPNFETRIAIIRKKAESFSDEIPNEVLSFIADSIHSNIRELEG
ALTTVFAYSKLHNAPINLDSAKNALKDIFRKKDDIVITSEYIKEVTAKYFNITVEDMNSKKRTRSISLPRQVAMYITREI
TDLSLPKIGDEFGGRDHSTVIHACQKITEEMTINTDFKNLILRIEREINGK
>AFA46816.1 DNA polymerase III subunit beta [Acetobacterium woodii DSM 1030]
MKFNCSKESLMAALTIAQKAVSSKTTLTILEGILFSAADNKLLLRSTDLEIGVEVTIPAEVEKEGEIVLTASIIGELIRK
MSGSDIFFESNENHQIKIECLLSNFTLKGLSSEEFPAFPEIIDDHIFSIDAGVLKELIKGTLFSVATNENIPVLTGVKIE
LHADDINLIALDGYRLALKSGKIKNNLTEDLSVIIPSKSLSEINKVLSNYSGDVTVKFSKNQIFFEMDNTQFTSRLLEGD
FINYKQIIPVEKATQVKINRRLLLESSERAALLAREGKNNLIKMDFNQDQLILTSNADIGDVFEVIPIENSGESLKIAFN
SKFLIEALRVIEGDELMIDMTTSVGPGVLLPIEGNNFIYLILPVRIAEEN
>AFA46817.1 RNA-binding protein containing S4 domain [Acetobacterium woodii DSM 1030]
MQIIEINTEFIKIDQLLKYAGIVGNGSDVKFMILDGLIRVNGELCTQRNKKIRDGDLVEIEDYEPLQVKGITTTCLSKR

and the text file is:

222528058
222528059
222528060

The expected output is:

>gi|222528058|ref|AFA46815.1| chromosomal replication initiator protein DnaA [Acetobacterium woodii DSM 1030]
MDTSFKKIWDAALEIIKPDISPTSFNTWFLKIKPINYVNNTYYFLSENEFEKGILESRHIPLITNALAEVTGKTGQVKIV
LKEEDAMIAQNADSPNVSITNVEEKISNYQNNSMNPKYTFDSFVIGENNRFAHAACVAVSEAPSERYNPLFIYGGVGLGK
THLMQAIGHYILSYAPHKKVVYVSCEKFTNDFIDAIQNKTNISFRNKYRSADILLIDDIQFLAKKEGTQEEFFHTFNTLH
QENKQIVISSDRPPKEIPTLEERLRSRFESGLITDIYAPNFETRIAIIRKKAESFSDEIPNEVLSFIADSIHSNIRELEG
ALTTVFAYSKLHNAPINLDSAKNALKDIFRKKDDIVITSEYIKEVTAKYFNITVEDMNSKKRTRSISLPRQVAMYITREI
TDLSLPKIGDEFGGRDHSTVIHACQKITEEMTINTDFKNLILRIEREINGK
>gi|222528059|ref|AFA46816.1| chromosomal replication initiator protein DnaA [Acetobacterium woodii DSM 1030]
MKFNCSKESLMAALTIAQKAVSSKTTLTILEGILFSAADNKLLLRSTDLEIGVEVTIPAEVEKEGEIVLTASIIGELIRK
MSGSDIFFESNENHQIKIECLLSNFTLKGLSSEEFPAFPEIIDDHIFSIDAGVLKELIKGTLFSVATNENIPVLTGVKIE
LHADDINLIALDGYRLALKSGKIKNNLTEDLSVIIPSKSLSEINKVLSNYSGDVTVKFSKNQIFFEMDNTQFTSRLLEGD
FINYKQIIPVEKATQVKINRRLLLESSERAALLAREGKNNLIKMDFNQDQLILTSNADIGDVFEVIPIENSGESLKIAFN
SKFLIEALRVIEGDELMIDMTTSVGPGVLLPIEGNNFIYLILPVRIAEEN
>gi|222528060|ref|AFA46817.1| chromosomal replication initiator protein DnaA [Acetobacterium woodii DSM 1030]
MQIIEINTEFIKIDQLLKYAGIVGNGSDVKFMILDGLIRVNGELCTQRNKKIRDGDLVEIEDYEPLQVKGITTTCLSKR

I will greatly appreciate if you could help

ADD COMMENTlink written 6 months ago by na.poorin0

Firstly preparing the mapping file of accession and GI:

If you have Unix/Linux, it's simple:

paste <(seqkit seq -n -i seqs.fasta) gi.txt
AFA46815.1      222528058
AFA46816.1      222528059
AFA46817.1      222528060

If not, you may need help of csvtk which has windows version:

seqkit seq -n -i seqs.fasta | csvtk sample -H -t -p 1 -n > n-acc.tsv

csvtk sample -H -t -p 1 -n gi.txt > n-gi.tsv

csvtk join -H -t n-acc.tsv n-gi.tsv | csvtk -H -t cut -f 2,3 > acc2gi.tsv

Then you can replace AFA46815.1 with gi|222528058|ref|AFA46815.1|:

seqkit replace -k acc2gi.tsv -p "^(.+?) (.+)$" -r "gi|{kv}|ref|\$1| \$2" seqs.fasta > seqs.renamed.fasta

seqkit seq -n seqs.renamed.fasta 
gi|222528058|ref|AFA46815.1| chromosomal replication initiator protein DnaA [Acetobacterium woodii DSM 1030]
gi|222528059|ref|AFA46816.1| DNA polymerase III subunit beta [Acetobacterium woodii DSM 1030]
gi|222528060|ref|AFA46817.1| RNA-binding protein containing S4 domain [Acetobacterium woodii DSM 1030]
ADD REPLYlink modified 6 months ago • written 6 months ago by shenwei3564.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1110 users visited in the last hour