Question: How do I remove unwanted line from GFF3 file?
0
gravatar for eennadi
10 months ago by
eennadi0
eennadi0 wrote:

Please I would like to find out how to remove "-T" from the name in my GFF3 file

i have pasted a sample

For example I will like the ID ID=C1_00010W_A-T to become ID=C1_00010W_A in otherwords the ID section to match the Parent section

CaO19.6115,IPF21113.1,IPF27828.1,orf19.13534,orf19.6115
Ca22chr1A_C_albicans_SC5314 CGD mRNA    4059    4397    .   +   .   ID=C1_00010W_A-T;Parent=C1_00010W_A;Name=C1_00010W_A;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious;Alias=C1_00010W,C1_00010W_B,CaO19.11880,CaO19.13534,CaO19.4402,CaO19.6115,IPF21113.1,IPF27828.1,orf19.13534,orf19.6115
Ca22chr1A_C_albicans_SC5314 CGD exon    4059    4397    .   +   .   ID=C1_00010W_A-T-E1;Parent=C1_00010W_A-T
Ca22chr1A_C_albicans_SC5314 CGD CDS 4059    4397    .   +   0   ID=C1_00010W_A-P;Parent=C1_00010W_A-T;orf_classification=Dubious;parent_feature_type=ORF
Ca22chr1B_C_albicans_SC5314 CGD gene    4059    4397    .   +   .   ID=C1_00010W_B;Name=C1_00010W_B;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious
Ca22chr1B_C_albicans_SC5314 CGD mRNA    4059    4397    .   +   .   ID=C1_00010W_B-T;Parent=C1_00010W_B;Name=C1_00010W_B;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious
Ca22chr1B_C_albicans_SC5314 CGD exon    4059    4397    .   +   .   ID=C1_00010W_B-T-E1;Parent=C1_00010W_B-T
Ca22chr1A_C_albicans_SC5314 CGD mRNA    4409    4720    .   -   .   ID=C1_00020C_A-T;Parent=C1_00020C_A;Name=C1_00020C_A;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized;Alias=C1_00020C,C1_00020C_B,CAWG_03102,CaO19.13533,CaO19.6114,IPF21135.1,IPF27840.1,orf19.13533,orf19.6114,orf6.6227
Ca22chr1A_C_albicans_SC5314 CGD exon    4409    4720    .   -   .   ID=C1_00020C_A-T-E1;Parent=C1_00020C_A-T
Ca22chr1A_C_albicans_SC5314 CGD CDS 4409    4720    .   -   0   ID=C1_00020C_A-P;Parent=C1_00020C_A-T;orf_classification=Uncharacterized;parent_feature_type=ORF
Ca22chr1B_C_albicans_SC5314 CGD gene    4409    4720    .   -   .   ID=C1_00020C_B;Name=C1_00020C_B;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized
Ca22chr1B_C_albicans_SC5314 CGD mRNA    4409    4720    .   -   .   ID=C1_00020C_B-T;Parent=C1_00020C_B;Name=C1_00020C_B;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized
Ca22chr1B_C_albicans_SC5314 CGD exon    4409    4720    .   -   .   ID=C1_00020C_B-T-E1;Parent=C1_00020C_B-T
Ca22chr1B_C_albicans_SC5314 CGD CDS 4409    4720    .   -   0   ID=C1_00020C_B-P;Parent=C1_00020C_B-T;orf_classification=Uncharacterized;parent_feature_type=ORF
Ca22chr1A_C_albicans_SC5314 CGD gene    8597    8908
genome • 190 views
ADD COMMENTlink modified 10 months ago by mito10 • written 10 months ago by eennadi0
1

I'm not sure that's valid according to the GFF3 specs? (ID should be unique ?)

ADD REPLYlink written 10 months ago by lieven.sterck9.1k

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 10 months ago by GenoMax92k
0
gravatar for mito
10 months ago by
mito10
mito10 wrote:
sed 's:\([ACGT]\)-[ACGT]:\1:g' my_file.gff3

This searches for all occurances of nucleotide1-nucleotide2 and reduces it to nucleotide1

You can use the -i flag for sed to do the replacement in-place.

edit: It appears that the characters that you want to remove are not always nucleotide characters. But it seems that they are always upper-case. The following replaces all occurances of upper_case1-upper_case2 with upper_case1:

sed 's/\([[:upper:]]\)-[[:upper:]]/\1/g' my_file.gff3
ADD COMMENTlink modified 10 months ago • written 10 months ago by mito10

it's not referring to nucleotides but to gene/transcript names (eg the second one is _B-T) which would already not fit your regex.

AND: always be careful when using the -i flag !

ADD REPLYlink written 10 months ago by lieven.sterck9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2102 users visited in the last hour