Genome Preparation for SynNet-Pipeline
0
0
Entering edit mode
3.9 years ago
galaxy • 0

Hi, I wanted to use this pipeline to do a synteny analysis between a couple of genomes. I'm using sequences from NCBI as well as phytozome (cds and GFF files). One of the requirements for the pipeline is that I shorten the gene names and add unique identifiers. This is what is said in the documentation(https://github.com/zhaotao1987/SynNet-Pipeline/wiki/Genome-Preparation):


  1. Process the sequence file

Shorten the gene names and add unique identifiers

When genome sequence files in fasta are downloaded, we'll see that gene names can be named in all kinds of different fashions.

For example gene names in formats like:

Aradu.20JM2 genotype-assembly-annot=V14167.a1.M1

DTZ79_01g11390

AT1G01010.1 | NAC domain containing protein 1 | Chr1:3760-5630 FORWARD LENGTH=429 | 201606

Bv1_000040_cpku.t1 cDNAEvidence=88.9

Except the second gene name (which may also needs a meaningful species prefix), the other ones all need to be shortened. For example, removing all the characters following the white space, also better to replace dots in names, and add an unique 3-5 letters long prefix for the species. So names like >ath_AT1G01010, >Aradu_20JM2, .. look good.

Note several the other things as well.

check whether the total number of sequences in the fasta can match the records in the GFF file (counted by 'genes' of the 3rd column) make sure no alternative transcripts or protein sequences in the fasta file for a single coding region. remove '*' at the end of the each sequence (which may have a problem when build sequence database by Diamond later) Once it's ready, name the sequence file as "abc.pep", where abc the species abbreviation.


My question is basically, is there an easy way to go through these fasta files and edit the gene names? I'm familiar with Python and Bash, but am struggling with coming up with a efficient code to do this. Could anyone help me with this?

Synteny SynNet Bash Python • 688 views
ADD COMMENT

Login before adding your answer.

Traffic: 3183 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6