How to change headers of fasta files with the help of another file (nucl_wgs.accession2taxid.gz) containing the desired names (taxID_ accession.version)
1
0
Entering edit mode
5 months ago
Sebastian • 0

How could I change the headers of different fasta files with the help of the large file nucl_wgs.accession2taxid.gz with bash scripting?

I show the following example. All fasta files in the header begin with the GenBank accession version:

 $ grep "^>" GCA_000002435.2_UU_WB_2.1_genomic.fna | head
>CM018789.1 Giardia intestinalis strain WB C6 chromosome 1, whole genome shotgun sequence
>CM018790.1 Giardia intestinalis strain WB C6 chromosome 2, whole genome shotgun sequence
>CM018791.1 Giardia intestinalis strain WB C6 chromosome 3, whole genome shotgun sequence
>CM018792.1 Giardia intestinalis strain WB C6 chromosome 4, whole genome shotgun sequence
>CM018793.1 Giardia intestinalis strain WB C6 chromosome 5, whole genome shotgun sequence
>AACB03000006.1 Giardia intestinalis strain WB C6 tig00000001, whole genome shotgun sequence
>AACB03000007.1 Giardia intestinalis strain WB C6 tig00000004, whole genome shotgun sequence

This accession version also appears in the large file nucl_wgs.accession2taxid.gz available in https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz:

 $ zcat nucl_wgs.accession2taxid.gz | head
accession       accession.version       taxid   gi
AAAA00000000    AAAA00000000.2  39946   54362548
AAAA02000001    AAAA02000001.1  39946   54312315
AAAA02000002    AAAA02000002.1  39946   54312316
AAAA02000003    AAAA02000003.1  39946   54312317
AAAA02000004    AAAA02000004.1  39946   54312318
AAAA02000005    AAAA02000005.1  39946   54312319
AAAA02000006    AAAA02000006.1  39946   54312320
AAAA02000007    AAAA02000007.1  39946   54312321
AAAA02000008    AAAA02000008.1  39946   54312322

I am interested in changing the headers of my fasta files in the way of changing accession.version to taxid_accession.version (column 3 and 2 of nucl_wgs.accession2taxid.gz file).

wgs GenBank fasta bash Taxonomy • 1.8k views
ADD COMMENT
0
Entering edit mode
5 months ago

seqkit replace could edit fasta header with regular expression, matched patterns could be replaced with corresponding values via a tabular key-value file.

(updated)

  1. Extracting accession.versions in the fasta file

     seqkit seq -n -i files.fna > acc.txt
    
  2. Extracting accession.versions and their taxids with help of csvtk grep

     zcat nucl_wgs.accession2taxid.gz \
         | csvtk grep -t -f accession.version -P acc.txt \
         | csvtk cut -t -f accession.version,taxid \
         > acc2taxid.txt
    
  3. Replacing accession.version with taxids

     seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \
         files.fna -o result.fna
    

The orignial solution produces a huge acc2taxid.txt that would exhaust the RAM.

  1. Preparing the key-value file.

     zcat nucl_wgs.accession2taxid.gz \
         | cut -f 2,3 > acc2taxid.txt
    
  2. Replacing accession.version with taxids

     seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \
         files.fna -o result.fna
    
ADD COMMENT
0
Entering edit mode

I obtain an error, the process runs but at the end is killed.

$ seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \ GCA_000002435.2_UU_WB_2.1_genomic.fna -o result
.fna

[INFO] read key-value file: acc2taxid.txt


Killed
ADD REPLY
0
Entering edit mode

I see, the acc2taxid.txt is too big to fill into the main memory. I've updated the answer which generates a subset acc2taxid.txt.

ADD REPLY

Login before adding your answer.

Traffic: 2602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6