Hi all, I'm new to awk, sed, and grep but I'm thinking they can help me with something.
I have a self-made gtf file that I'm using as a mask file for differential expression analyses (transcripts from my de novo assembly that contains rRNA contamination) but I need to change the final column. They must all be unique and in a particular format.
For example, right now I have this:
TRINITY_DN27462_c57_g1 CEH transcript 1 3012 1000 + * gene_id exclude.1
TRINITY_DN27462_c57_g2 CEH transcript 1 1224 1000 + * gene_id exclude.2
TRINITY_DN27462_c57_g3 CEH transcript 1 539 1000 + * gene_id exclude.3
TRINITY_DN27098_c57_g4 CEH transcript 1 350 1000 + * gene_id exclude.4
but I need it to be this:
TRINITY_DN27462_c57_g1 CEH transcript 1 3012 1000 + * gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2 CEH transcript 1 1224 1000 + * gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3 CEH transcript 1 539 1000 + * gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098__c7_g1 CEH transcript 1 350 1000 + * gene_id "exclude.4"; transcript_id "exclude.4.1"
so the final column needs to have this string of text that includes increase by 1 in two places for each row. I have 493 lines in my gtf file so there has to be a way to easily to do this. Can anyone give me some tips/pointers?
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.some thing like this: