Hi, I wanted to use this pipeline to do a synteny analysis between a couple of genomes. I'm using sequences from NCBI as well as phytozome (cds and GFF files). One of the requirements for the pipeline is that I shorten the gene names and add unique identifiers. This is what is said in the documentation(https://github.com/zhaotao1987/SynNet-Pipeline/wiki/Genome-Preparation):
- Process the sequence file
Shorten the gene names and add unique identifiers
When genome sequence files in fasta are downloaded, we'll see that gene names can be named in all kinds of different fashions.
For example gene names in formats like:
Aradu.20JM2 genotype-assembly-annot=V14167.a1.M1
DTZ79_01g11390
AT1G01010.1 | NAC domain containing protein 1 | Chr1:3760-5630 FORWARD LENGTH=429 | 201606
Bv1_000040_cpku.t1 cDNAEvidence=88.9
Except the second gene name (which may also needs a meaningful species prefix), the other ones all need to be shortened. For example, removing all the characters following the white space, also better to replace dots in names, and add an unique 3-5 letters long prefix for the species. So names like >ath_AT1G01010, >Aradu_20JM2, .. look good.
Note several the other things as well.
check whether the total number of sequences in the fasta can match the records in the GFF file (counted by 'genes' of the 3rd column) make sure no alternative transcripts or protein sequences in the fasta file for a single coding region. remove '*' at the end of the each sequence (which may have a problem when build sequence database by Diamond later) Once it's ready, name the sequence file as "abc.pep", where abc the species abbreviation.
My question is basically, is there an easy way to go through these fasta files and edit the gene names? I'm familiar with Python and Bash, but am struggling with coming up with a efficient code to do this. Could anyone help me with this?