Compare two protein FASTA files and give a excel that show header with the same sequence
2
0
Entering edit mode
2.4 years ago
robert • 0

Dear All, I have two files file1.fasta file2.fasta. Both contain some identical sequences but different headers. I want to know the correspondence relationship between the headers of the two fasta files and may be output in the two columns of a EXCEL file.

1.fasta red rectangle is the header i want

2.fasta rectangle is the header i want

BLAST FASTA • 1.7k views
ADD COMMENT
0
Entering edit mode

Assuming that both the fastas are flattened,

$ join -1 2 -2 2 <(seqkit -w 0 fx2tab a.fa ) <(seqkit -w 0 fx2tab b.fa) -o 1.1,2.1
ADD REPLY
3
Entering edit mode
2.4 years ago

Seems you need the relationship of two versions of genome annotations.

Why save in EXCEL file? Tab-delimited plain text files with extension .xls can be recognized by MS EXCEL too.

Anyway, csvtk could export .xlsx files. For windows, please replace \ with ^ in the code below (ref: Split long commands in multiple lines through Windows batch file).

# 1) sequence id -> sequence / seq hash

seqkit seq --remove-gaps --lower-case file1.fasta \
    | seqkit fx2tab --seq-hash \
    | csvtk cut -Ht -f 1,4 \
    > file1.fasta.hash

seqkit seq --remove-gaps --lower-case file2.fasta \
    | seqkit fx2tab --seq-hash \
    | csvtk cut -Ht -f 1,4 \
    > file2.fasta.hash

# 2) join two tables according to the seq hash, and convert the result to XLSX

csvtk join -Ht -f 2 file1.fasta.hash file2.fasta.hash \
    | csvtk cut -Ht -f 1,3 \
    | csvtk add-header -Ht -n file1,file2 \
    | csvtk csv2xlsx -t -o result.xlsx
ADD COMMENT
0
Entering edit mode

Thank you for such detailed reply, I am very grateful for it. Could you tell me how to run this code? I am in windows10 system and I have conda and docker installed. should I install seqkit in python or docker before running this code?

I am currently working on metabolic network reconstruction of an eukaryote microorganism. But I don't have coding experience, and my current network and KEGG metabolic information was in excel format so I want it in excel format. I could use Vlookup to map my existing network to the genome and incorporate data from multiple sources. I know it's a dummy way, but I don't know other ways to do this.

ADD REPLY
2
Entering edit mode

Both seqkit and csvtk provide executable binaries for Windows, the download page shows how to install in Windows (download, decompress, copy to C:\WINDOWS\system32).

How to Run an EXE File From Command Prompt

ADD REPLY
0
Entering edit mode
2.4 years ago
Mark ★ 1.5k

If you want the headers of a fasta:

grep ">" file.fasta > file.txt

Then copy the contents of the file.txt into excel

ADD COMMENT

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6