Question

changing name of chromosomes in fasta and GFF3 file

2

Entering edit mode

6.0 years ago

blooming.daisy333 ▴ 110

hi, Im newbie to linux and NGS. I have genome.fasta file and transcripts.gff3 file. Both files have different chromosome naming pattern due to which im unable to use them in cufflinks as its giving following error:

Warning: couldn't find fasta record for 'CA_chr13_BGI-A2_v1.0'!
This contig will not be bias corrected.

can someone kindly help me how to equate chromosome naming in both files???

i mean the chromosome name for fasta file should exactly be the same as to that in gff3 file or vice versa. the chromosome names in the fasta file are as follows:

gnl|BGIA2|CA_chr1

gnl|BGIA2|CA_chr2

while in gff3 file these are as follows:

CA_chr4_BGI-A2_v1.0

CA_chr6_BGI-A2_v1.0

it would be more convenient if they are retained to simply as chr1, chr2,chr3 and so on rather than this long naming. further chr1 in fasta file should exactly be the chr1 in gff3 file.

kind help would be highly appreciated.

thanks

next-gen • 7.1k views

ADD COMMENT • link updated 18 months ago by zyndrome • 0 • written 6.0 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

you can try following. Take a backup of your data and try on sample data first. If you are on mac, sed -i needs output filename. Assuming that all the sequence headers (in fasta) and sequence IDs in gff3 follow the same naming pattern in OP, OP can try following code. Please note that example gff3 format is not correct. Example code needs first column as Sequence IDs are in first column only.

input gff3 with first column only:

$ cat test.gff3 
    ##gff-version 3.2.1
    ##sequence-region chr 1 1497228
    CA_chr4_BGI-A2_v1.0
    CA_chr6_BGI-A2_v1.0

output:

 $ sed  '/#/! s/.*_\(chr[0-9]\+\).*/\1/g' test.gff3 

##gff-version 3.2.1
##sequence-region chr 1 1497228
chr4
chr6

input fasta:

$ cat test.fa 
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc

output fasta:

$ sed  '/>/ s/.*_\(chr[0-9]\+\)/>\1/g' test.fa 
>chr1
atgc
>chr22
atgc
>chr10
atgc
>chr13
atgc

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

thank you so much for the kind help. i ll try it and let you know. kindly explain one thing that whether (chr[1-9] means chromosome 1-9 ??? in that case i think it would work only for first 9 chromosomes while i have 13 chromosome.

ADD REPLY • link 6.0 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

It is chr[1-9]+ where + matches one or more times of [1-9] (In case of chromosomes, two digits) . However it skips 10. I updated the code and example data.

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

Dear cpado, I have tried this code on my gff3 file

sed  '/#/! s/.*_\(chr[0-9]\+\).*/\1/g' myfile.gff3

but there is no change in the naming of the first column (chromosome names are the same as in the parent file i.e

CA_chr4_BGI-A2_v1.0

CA_chr6_BGI-A2_v1.0

i simply placed your code before my file. i could see the terminal output as follows : chr4

chr4

chr1

but no change in first column. and if i append it in new file it give me only first column while al other columns are deleted.

Need your kind guidance plz

ADD REPLY • link 6.0 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

Could you please post few gff3 records here? For eg. head(file.gff3)

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

yes plz see below

##gff-version null
CA_chr4_BGI-A2_v1.0     GLEAN   mRNA    123284514       123288477       0.999991-       .       ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123288376       123288477       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287662       123287826       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287427       123287536       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287129       123287237       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286939       123287051       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286180       123286330       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123284514       123285671       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0     GLEAN   mRNA    17802711        17803334        1      +.       ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

ADD REPLY • link 6.0 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

$ sed  '/#/! s/.*_\(chr[0-9]\+\).*\(\tGL*\)/\1\2/g' test.gtf

Try this let me know. Assumption is that GLEAN is always in second column.

##gff-version   null
chr4    GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
chr4    GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr9    GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

thank you so much for the kind help and your precious time. I have tried this. im getting good results for some of the entries while other are unchanged. pasting some of the examples from the output file opened in excel below:;

chr3    GLEAN
chr4    GLEAN
chr4    GLEAN
chr4    GLEAN
chr11   GLEAN
chr11   GLEAN
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff
CA_chr4_BGI-A2_v1.0 Cuff

and

chr2    GLEAN
chr2    GLEAN
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
CA_chr7_BGI-A2_v1.0 Cuff
chr4    GeneWise
chr4    GeneWise
chr4    GeneWise

plz help me to fix it

thanks

ADD REPLY • link 5.9 years ago by blooming.daisy333 ▴ 110

1

Entering edit mode

6.0 years ago

GenoMax 141k

While you could change the names in your fasta file by doing

sed -e 's/gnl|BGIA2|CA_chr1/CA_chr1_BGI-A2_v1.0/' -e 's/gnl|BGIA2|CA_chr2/CA_chr2_BGI-A2_v1.0/' fasta > new_fasta

you will need to samtools reheader the alignment files (if you have already done the alignments). It may be simpler to just do the alignments again after changing the names to ensure that everything remains in sync.

If you want to simply replace the names with chr* then do

sed -e 's/gnl|BGIA2|CA_chr1/chr1/' -e 's/gnl|BGIA2|CA_chr2/chr2/' fasta > new_fasta

ADD COMMENT • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

in the same way while i have used the code

sed -e 's/gnl|BGIA2|CA_chr1/CA_chr1_BGI-A2_v1.0/' -e 's/gnl|BGIA2|CA_chr2/CA_chr2_BGI-A2_v1.0/' fasta > new_fasta

its changing only first chromosome name. I could not find the other names by "grep" command.

while this one (given below) is not working for chromosome 3-9 while changing names only for others (i have 13 chromosome).

sed -e 's/gnl|BGIA2|CA_chr1/chr1/' -e 's/gnl|BGIA2|CA_chr2/chr2/' fasta > new_fasta

need your kind guidance in this regard

thanks

ADD REPLY • link 5.9 years ago by blooming.daisy333 ▴ 110

score 2 · Accepted Answer · 2018-05-08

2

Entering edit mode

5.9 years ago

cpad0112 21k

Try this and let me know (assuming that file has # before headers):

$ awk  '!/#/ {split($1, a, "_"); $1=a[2]}1'  OFS='\t' test.gtf

input gtf:

$ cat test.gtf 
##gff-version   null
CA_chr4_BGI-A2_v1.0 GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 TEMP    CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 OUT CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 NYCTY   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0 GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

output gtf:

$ awk  '!/#/ {split($1, a, "_"); $1=a[2]}1'  OFS='\t' test.gtf
##gff-version   null
chr4    GLEAN   mRNA    123284514   123288477   0.999991-   .   ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
chr4    GLEAN   CDS 123288376   123288477   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287662   123287826   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287427   123287536   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    GLEAN   CDS 123287129   123287237   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    TEMP    CDS 123286939   123287051   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    OUT CDS 123286180   123286330   .   -1  Parent=Cotton_A_18927_BGI-A2_v1.0
chr4    NYCTY   CDS 123284514   123285671   .   -0  Parent=Cotton_A_18927_BGI-A2_v1.0
chr9    GLEAN   mRNA    17802711    17803334    1   +.  ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

ADD COMMENT • link 5.9 years ago by cpad0112 21k

1

Entering edit mode

it is working absolutely fine now. thank you soooo much for the kind help... can you plz kindly help me to resolve the same issue with fasta file??? i need same chromosome naming in fasta file too. while the suggessions given below are working only for few chromosome headers while not for others. i.e it is changing name of chromosome for some but not for others... kind help would highly be appreciated

awk version is GNU Awk 3.1.7 while gawk is also there

thanks again

ADD REPLY • link 5.9 years ago by blooming.daisy333 ▴ 110

1

Entering edit mode

For fasta file, try this and let me know:

$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa

Assuming that headers are the same format where there is an _ before chr (chromosome) entries. (Eg. >gnl|BGIA2|CA_chr1)

input:

$ cat test.fa 
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc

output:

$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa 
>chr1
atgc
>chr22
atgc
>chr10
atgc
>chr13
atgc

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

yes, it is working absolutely fine. im extremely thankful to you for your kind help. the effort and the time that you have spent to guide a newbie like me is worth appreciating.

thank you again

ADD REPLY • link 5.9 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

Np. Good luck with your research.

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

try again and let me know. There was some copy/paste mistake. Check if you are using gawk and type $ awk --version in console, to know the awk version. Please also check copying " (apstrophe) as some times, copying special characters from web creates issues.

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

I have a similar issue but instead of "CA_chr4_BGI-A2_v1.0" I only have the number "4".

Example rows:

1   havana  gene    62948   63887   .   +   .   ID=gene:ENSG00000240361;Name=OR4G11P;biotype=unprocessed_pseudogene;description=olfactory receptor family 4 subfamily G member 11 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:31276];gene_id=ENSG00000240361;logic_name=havana;version=1
1   havana  pseudogene  62948   63887   .   +   .   ID=transcript:ENST00000492842;Parent=gene:ENSG00000240361;Name=OR4G11P-201;biotype=unprocessed_pseudogene;tag=basic;transcript_id=ENST00000492842;transcript_support_level=NA;version=1

I tried:

$ awk '!/#/ {print "chr"$1, $1=a[2]}1'  OFS='\t' my.gff3 > test.gff3

But it returns

chr1
    havana  gene    62948   63887   .   +   .   ID=gene:ENSG00000240361;Name=OR4G11P;biotype=unprocessed_pseudogene;description=olfactory receptor family 4 subfamily G member 11 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:31276];gene_id=ENSG00000240361;logic_name=havana;version=1
chr1
    havana  pseudogene  62948   63887   .   +   .   ID=transcript:ENST00000492842;Parent=gene:ENSG00000240361;Name=OR4G11P-201;biotype=unprocessed_pseudogene;tag=basic;transcript_id=ENST00000492842;transcript_support_level=NA;version=1

I need the return to be a single row instead of the two I am now getting.

Help would be greatly appreciated. Thanks!

ADD REPLY • link 18 months ago by zyndrome • 0