I have a file containing all the protein fastas from my genome of interest (downloaded from NCBI):
I've been conducting some analysis and have a list of genes I'm interested in; however, I don't have the accession numbers, I have a list of partial gene IDs assigned by the people who did the annotation:
Is there a way I can search the protein fastas file for a list of proteins using information from the protein headers?
I've tried grep -f genes_of_interest.txt protein_fastas_file.faa > output.fa
in an attempt to at least pull out the headers so I can get the accession numbers but all it did was return every single protein fasta (ie. exactly the same file as protein_fastas_file.faa
). I'm assuming this is because the actual sequence part doesn't actually exist on a new line or something?
Thanks in advance for any help anyone can give!
Input files are taken from GenoMax post
input:
ouput:
Download seqkit from here: https://bioinf.shenwei.me/seqkit/download/