Question: Awk, Sed, Grep to create a sequential numerical text replacement in my gtf file
0
gravatar for sdbaney
12 weeks ago by
sdbaney0
sdbaney0 wrote:

Hi all, I'm new to awk, sed, and grep but I'm thinking they can help me with something.

I have a self-made gtf file that I'm using as a mask file for differential expression analyses (transcripts from my de novo assembly that contains rRNA contamination) but I need to change the final column. They must all be unique and in a particular format.

For example, right now I have this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012   1000   +    *    gene_id exclude.1
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224   1000   +    *    gene_id exclude.2
TRINITY_DN27462_c57_g3   CEH   transcript   1    539    1000   +    *    gene_id exclude.3
TRINITY_DN27098_c57_g4   CEH   transcript   1    350    1000   +    *    gene_id  exclude.4

but I need it to be this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012    1000   +    *    gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224    1000   +    *    gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3   CEH   transcript   1     539    1000   +    *    gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098__c7_g1   CEH   transcript   1     350    1000   +    *    gene_id "exclude.4"; transcript_id "exclude.4.1"

so the final column needs to have this string of text that includes increase by 1 in two places for each row. I have 493 lines in my gtf file so there has to be a way to easily to do this. Can anyone give me some tips/pointers?

awk sed grep • 182 views
ADD COMMENTlink modified 12 weeks ago by RamRS20k • written 12 weeks ago by sdbaney0

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 12 weeks ago by RamRS20k

some thing like this:

$ awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,"\""$10"\"; transcript_id " "\""$10".1\""}' test.txt
TRINITY_DN27462_c57_g1  CEH transcript  1   3012    1000    +   *   gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2  CEH transcript  1   1224    1000    +   *   gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3  CEH transcript  1   539 1000    +   *   gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098_c57_g4  CEH transcript  1   350 1000    +   *   gene_id "exclude.4"; transcript_id "exclude.4.1"
ADD REPLYlink written 12 weeks ago by cpad011211k
3
gravatar for RamRS
12 weeks ago by
RamRS20k
Houston, TX
RamRS20k wrote:
sed -r 's/gene_id([ ]+)exclude[.]([0-9]+)/gene_id\1"exclude\2"; transcript_id\1"exclude\2.1"/'

If you only want a hint: Capture the number after exclude and use it with a backreference, appending .1 where required.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by RamRS20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour