Question: How to rename chromosome names in GTF file?
0
gravatar for cristian
4 weeks ago by
cristian80
cristian80 wrote:

Hi,

I have a GTF file with the following head:

head celegans.gtf
CHROMOSOME_I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
CHROMOSOME_I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
CHROMOSOME_I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

However, my FASTA file has the following chromosome names:

grep '>' celegans.fa
>I
>II
>III
>IV
>V
>X
>MtDNA

This discrepancy causes problems in downstream analyses. Does anyone know of a tool or way to rename the chromosome names in my GTF file to correspond to the chromosome names in the FASTA file?

Thanks.

Best, C.

genome annotation fasta gtf • 218 views
ADD COMMENTlink written 4 weeks ago by cristian80
2

Try running sed -e 's/CHROMOSOME_//g' celegans.gtf. Check the output and if that works try : sed -i 's/CHROMOSOME_//g' celegans.gtf

ADD REPLYlink written 4 weeks ago by cpad01122.3k

Hi,

sed -e 's/CHROMOSOME_//g' celegans.gtf

works but the command with the '-i' option gives the following error:

sed: 1: "output/genome/celegans/ ...": invalid command code o

Does the '-i' mean 'in-place' so changes the file directly? I am going to try to redirect the output of the '-e' command to the file itself.

Oups, this erased the content of the file. So the 'return' of the 'sed -e' command is NULL?

Best, C.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by cristian80
1

yes, unfortunately so. Before posting, I tried with example data and worked: (I am on Ubuntu and sed v4.2.2). I guess you are on MacOS and sed -i issue is discussed here and work around is given at the end of the post.

:~/Desktop $ sed -e 's/CHROMOSOME_//g' test.gtf 
head celegans.gtf
I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

:~/Desktop $ sed -i 's/CHROMOSOME_//g' test.gtf 

~/Desktop $ cat test.gtf 
head celegans.gtf
I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

$ uname -a
Linux genomics 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by cpad01122.3k

Correct guess, thanks. It worked now. So you made use of the fact that the GTF file just had 'CHROMOSOME_' prepended to all my FASTA chromosome names, right? Do you mind explaining this: 's/CHROMOSOME_//g' ?

ADD REPLYlink written 4 weeks ago by cristian80
1

Correct. Sed syntax is s/old string /newstring/ (/ is a markup for before and after). g is for global replacement (entire file). Other wise only first match (old string) will be replaced. Entire substitution is in quotes. In above line, chromosome_ is old string and is replaced with no space in short it got removed.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by cpad01122.3k

Maybe you can consolidate your comments into an answer and I can accept it?

ADD REPLYlink written 4 weeks ago by cristian80
1

You could redirect to a new file and use that, no real need for -i

sed 's/CHROMOSOME_//g' celegans.gtf > celegans.noChr.gtf
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by bruce.moran320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 953 users visited in the last hour