How do I remove unwanted line from GFF3 file?
1
0
Entering edit mode
22 months ago

Please I would like to find out how to remove "-T" from the name in my GFF3 file

i have pasted a sample

For example I will like the ID ID=C1_00010W_A-T to become ID=C1_00010W_A in otherwords the ID section to match the Parent section

CaO19.6115,IPF21113.1,IPF27828.1,orf19.13534,orf19.6115
Ca22chr1A_C_albicans_SC5314 CGD mRNA    4059    4397    .   +   .   ID=C1_00010W_A-T;Parent=C1_00010W_A;Name=C1_00010W_A;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious;Alias=C1_00010W,C1_00010W_B,CaO19.11880,CaO19.13534,CaO19.4402,CaO19.6115,IPF21113.1,IPF27828.1,orf19.13534,orf19.6115
Ca22chr1A_C_albicans_SC5314 CGD exon    4059    4397    .   +   .   ID=C1_00010W_A-T-E1;Parent=C1_00010W_A-T
Ca22chr1A_C_albicans_SC5314 CGD CDS 4059    4397    .   +   0   ID=C1_00010W_A-P;Parent=C1_00010W_A-T;orf_classification=Dubious;parent_feature_type=ORF
Ca22chr1B_C_albicans_SC5314 CGD gene    4059    4397    .   +   .   ID=C1_00010W_B;Name=C1_00010W_B;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious
Ca22chr1B_C_albicans_SC5314 CGD mRNA    4059    4397    .   +   .   ID=C1_00010W_B-T;Parent=C1_00010W_B;Name=C1_00010W_B;Note=%28orf19.6115%29%20Dubious%20open%20reading%20frame;orf_classification=Dubious
Ca22chr1B_C_albicans_SC5314 CGD exon    4059    4397    .   +   .   ID=C1_00010W_B-T-E1;Parent=C1_00010W_B-T
Ca22chr1A_C_albicans_SC5314 CGD mRNA    4409    4720    .   -   .   ID=C1_00020C_A-T;Parent=C1_00020C_A;Name=C1_00020C_A;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized;Alias=C1_00020C,C1_00020C_B,CAWG_03102,CaO19.13533,CaO19.6114,IPF21135.1,IPF27840.1,orf19.13533,orf19.6114,orf6.6227
Ca22chr1A_C_albicans_SC5314 CGD exon    4409    4720    .   -   .   ID=C1_00020C_A-T-E1;Parent=C1_00020C_A-T
Ca22chr1A_C_albicans_SC5314 CGD CDS 4409    4720    .   -   0   ID=C1_00020C_A-P;Parent=C1_00020C_A-T;orf_classification=Uncharacterized;parent_feature_type=ORF
Ca22chr1B_C_albicans_SC5314 CGD gene    4409    4720    .   -   .   ID=C1_00020C_B;Name=C1_00020C_B;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized
Ca22chr1B_C_albicans_SC5314 CGD mRNA    4409    4720    .   -   .   ID=C1_00020C_B-T;Parent=C1_00020C_B;Name=C1_00020C_B;Note=%28orf19.6114%29%20Protein%20of%20unknown%20function%3B%20transcript%20detected%20on%20high-resolution%20tiling%20arrays;orf_classification=Uncharacterized
Ca22chr1B_C_albicans_SC5314 CGD exon    4409    4720    .   -   .   ID=C1_00020C_B-T-E1;Parent=C1_00020C_B-T
Ca22chr1B_C_albicans_SC5314 CGD CDS 4409    4720    .   -   0   ID=C1_00020C_B-P;Parent=C1_00020C_B-T;orf_classification=Uncharacterized;parent_feature_type=ORF
Ca22chr1A_C_albicans_SC5314 CGD gene    8597    8908

genome • 445 views
1
Entering edit mode

I'm not sure that's valid according to the GFF3 specs? (ID should be unique ?)

0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

Thank you!

0
Entering edit mode
22 months ago
mito ▴ 10
sed 's:$$[ACGT]$$-[ACGT]:\1:g' my_file.gff3


This searches for all occurances of nucleotide1-nucleotide2 and reduces it to nucleotide1

You can use the -i flag for sed to do the replacement in-place.

edit: It appears that the characters that you want to remove are not always nucleotide characters. But it seems that they are always upper-case. The following replaces all occurances of upper_case1-upper_case2 with upper_case1:

sed 's/$$[[:upper:]]$$-[[:upper:]]/\1/g' my_file.gff3

0
Entering edit mode

it's not referring to nucleotides but to gene/transcript names (eg the second one is _B-T) which would already not fit your regex.

AND: always be careful when using the -i flag !