Question: changing name of chromosomes in fasta and GFF3 file
2
gravatar for blooming.daisy333
2.3 years ago by
blooming.daisy33390 wrote:

hi, Im newbie to linux and NGS. I have genome.fasta file and transcripts.gff3 file. Both files have different chromosome naming pattern due to which im unable to use them in cufflinks as its giving following error:

Warning: couldn't find fasta record for 'CA_chr13_BGI-A2_v1.0'!
This contig will not be bias corrected.

can someone kindly help me how to equate chromosome naming in both files???

i mean the chromosome name for fasta file should exactly be the same as to that in gff3 file or vice versa. the chromosome names in the fasta file are as follows:

gnl|BGIA2|CA_chr1

gnl|BGIA2|CA_chr2

while in gff3 file these are as follows:

CA_chr4_BGI-A2_v1.0

CA_chr6_BGI-A2_v1.0

it would be more convenient if they are retained to simply as chr1, chr2,chr3 and so on rather than this long naming. further chr1 in fasta file should exactly be the chr1 in gff3 file.

kind help would be highly appreciated.

thanks

next-gen • 2.0k views
ADD COMMENTlink modified 2.3 years ago by genomax87k • written 2.3 years ago by blooming.daisy33390

you can try following. Take a backup of your data and try on sample data first. If you are on mac, sed -i needs output filename. Assuming that all the sequence headers (in fasta) and sequence IDs in gff3 follow the same naming pattern in OP, OP can try following code. Please note that example gff3 format is not correct. Example code needs first column as Sequence IDs are in first column only.

input gff3 with first column only:

$ cat test.gff3 
    ##gff-version 3.2.1
    ##sequence-region chr 1 1497228
    CA_chr4_BGI-A2_v1.0
    CA_chr6_BGI-A2_v1.0

output:

 $ sed  '/#/! s/.*_\(chr[0-9]\+\).*/\1/g' test.gff3 

##gff-version 3.2.1
##sequence-region chr 1 1497228
chr4
chr6

input fasta:

$ cat test.fa 
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc

output fasta:

$ sed  '/>/ s/.*_\(chr[0-9]\+\)/>\1/g' test.fa 
>chr1
atgc
>chr22
atgc
>chr10
atgc
>chr13
atgc
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

thank you so much for the kind help. i ll try it and let you know. kindly explain one thing that whether (chr[1-9] means chromosome 1-9 ??? in that case i think it would work only for first 9 chromosomes while i have 13 chromosome.

ADD REPLYlink written 2.3 years ago by blooming.daisy33390

It is chr[1-9]+ where + matches one or more times of [1-9] (In case of chromosomes, two digits) . However it skips 10. I updated the code and example data.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

Dear cpado, I have tried this code on my gff3 file

sed  '/#/! s/.*_\(chr[0-9]\+\).*/\1/g' myfile.gff3

but there is no change in the naming of the first column (chromosome names are the same as in the parent file i.e

CA_chr4_BGI-A2_v1.0

CA_chr6_BGI-A2_v1.0

i simply placed your code before my file. i could see the terminal output as follows : chr4

chr4

chr4

chr1

chr1

chr1

but no change in first column. and if i append it in new file it give me only first column while al other columns are deleted.

Need your kind guidance plz

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by blooming.daisy33390

Could you please post few gff3 records here? For eg. head(file.gff3)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

yes plz see below

##gff-version null
CA_chr4_BGI-A2_v1.0     GLEAN   mRNA    123284514       123288477       0.999991-       .       ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123288376       123288477       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287662       123287826       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287427       123287536       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287129       123287237       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286939       123287051       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286180       123286330       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123284514       123285671       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0     GLEAN   mRNA    17802711        17803334        1      +.       ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1
ADD REPLYlink written 2.3 years ago by blooming.daisy33390
$ sed  '/#/! s/.*_\(chr[0-9]\+\).*\(\tGL*\)/\1\2/g' test.gtf

Try this let me know. Assumption is that GLEAN is always in second column.

##gff-version   null
chr4    GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
chr4    GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr9    GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

thank you so much for the kind help and your precious time. I have tried this. im getting good results for some of the entries while other are unchanged. pasting some of the examples from the output file opened in excel below:;

chr3    GLEAN
chr4    GLEAN
chr4    GLEAN
chr4    GLEAN
chr11   GLEAN
chr11   GLEAN
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff

and

chr2    GLEAN
chr2    GLEAN
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
chr4    GeneWise
chr4    GeneWise
chr4    GeneWise

plz help me to fix it

thanks

ADD REPLYlink written 2.3 years ago by blooming.daisy33390
1
gravatar for cpad0112
2.3 years ago by
cpad011213k
India
cpad011213k wrote:

Try this and let me know (assuming that file has # before headers):

$ awk  '!/#/ {split($1, a, "_"); $1=a[2]}1'  OFS='\t' test.gtf

input gtf:

$ cat test.gtf 
##gff-version   null
CA_chr4_BGI-A2_v1.0 GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 TEMP    CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 OUT CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 NYCTY   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0 GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

output gtf:

$ awk  '!/#/ {split($1, a, "_"); $1=a[2]}1'  OFS='\t' test.gtf
##gff-version   null
chr4    GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
chr4    GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    TEMP    CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    OUT CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    NYCTY   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr9    GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

it is working absolutely fine now. thank you soooo much for the kind help... can you plz kindly help me to resolve the same issue with fasta file??? i need same chromosome naming in fasta file too. while the suggessions given below are working only for few chromosome headers while not for others. i.e it is changing name of chromosome for some but not for others... kind help would highly be appreciated

awk version is GNU Awk 3.1.7 while gawk is also there

thanks again

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by blooming.daisy33390
1

For fasta file, try this and let me know:

$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa

Assuming that headers are the same format where there is an _ before chr (chromosome) entries. (Eg. >gnl|BGIA2|CA_chr1)

input:

$ cat test.fa 
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc

output:

$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa 
>chr1
atgc
>chr22
atgc
>chr10
atgc
>chr13
atgc
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

yes, it is working absolutely fine. im extremely thankful to you for your kind help. the effort and the time that you have spent to guide a newbie like me is worth appreciating.

thank you again

ADD REPLYlink written 2.3 years ago by blooming.daisy33390

Np. Good luck with your research.

ADD REPLYlink written 2.3 years ago by cpad011213k

try again and let me know. There was some copy/paste mistake. Check if you are using gawk and type $ awk --version in console, to know the awk version. Please also check copying " (apstrophe) as some times, copying special characters from web creates issues.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k
1
gravatar for genomax
2.3 years ago by
genomax87k
United States
genomax87k wrote:

While you could change the names in your fasta file by doing

sed -e 's/gnl|BGIA2|CA_chr1/CA_chr1_BGI-A2_v1.0/' -e 's/gnl|BGIA2|CA_chr2/CA_chr2_BGI-A2_v1.0/' fasta > new_fasta

you will need to samtools reheader the alignment files (if you have already done the alignments). It may be simpler to just do the alignments again after changing the names to ensure that everything remains in sync.

If you want to simply replace the names with chr* then do

sed -e 's/gnl|BGIA2|CA_chr1/chr1/' -e 's/gnl|BGIA2|CA_chr2/chr2/' fasta > new_fasta
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax87k

in the same way while i have used the code

sed -e 's/gnl|BGIA2|CA_chr1/CA_chr1_BGI-A2_v1.0/' -e 's/gnl|BGIA2|CA_chr2/CA_chr2_BGI-A2_v1.0/' fasta > new_fasta

its changing only first chromosome name. I could not find the other names by "grep" command.

while this one (given below) is not working for chromosome 3-9 while changing names only for others (i have 13 chromosome).

sed -e 's/gnl|BGIA2|CA_chr1/chr1/' -e 's/gnl|BGIA2|CA_chr2/chr2/' fasta > new_fasta

need your kind guidance in this regard

thanks

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by blooming.daisy33390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 826 users visited in the last hour