Question: Text file: combining lines together based on value within certain column
0
gravatar for jamespdowling
16 days ago by
jamespdowling0 wrote:

I posted this on StackOverflow and the guys over there recommended I post it here.

I have a 9 column .BED file containing the yeast genome, with coordinates for all ORFs, and some of the 5' and 3' UTRs.

I'm looking to combine rows within it based on the same gene name in column 9.

In the example below: rows 3, 4, 5 all have YAR014C in column 9 [3' UTR, gene, 5' UTR respectively]

Then replace the value in columns 4 and 5 (start and end coordinates) to be the column 4 value of the original line with '3UTR' in it, and the column 5 value of the original line with '5UTR' in it.

i.e. Condensing the entire gene into one, from 5' UTR start to 3' UTR end

The whole file doesn't follow the 3UTR, gene, 5UTR naming convention in column 9, so it would have to based on the specific value in column 9, rather than on row number.

Here's a portion of the file:

I   martin  exon    160597  164187  .   -   .   gene_id "YAR009C_ORF";
I   martin  exon    164544  165866  .   -   .   gene_id "YAR010C_ORF";
I   martin  exon    166574  166741  .   -   .   gene_id "YAR014C_3UTR";
I   martin  exon    166742  168871  .   -   .   gene_id "YAR014C_ORF";
I   martin  exon    168872  169022  .   -   .   gene_id "YAR014C_5UTR";
I   martin  exon    170352  170395  .   -   .   gene_id "YAR018C_3UTR";
I   martin  exon    170396  171703  .   -   .   gene_id "YAR018C_ORF";
I   martin  exon    171704  171743  .   -   .   gene_id "YAR018C_5UTR";
I   martin  exon    172136  172210  .   -   .   gene_id "YAR019C_3UTR";
I   martin  exon    172211  175135  .   -   .   gene_id "YAR019C_ORF";
I   martin  exon    176856  177023  .   -   .   gene_id "YAR020C_ORF";
I   martin  exon    179241  179280  .   -   .   gene_id "YAR023C_3UTR";
I   martin  exon    179281  179820  .   -   .   gene_id "YAR023C_ORF";
I   martin  exon    179821  180087  .   -   .   gene_id "YAR023C_5UTR";
I   martin  exon    186512  186853  .   -   .   gene_id "YAR030C_ORF";

So the result I'd like for rows 3,4,5 would be:

I     martin      exon       166574       169022      .      -      .       gene_id "YAR014C";

Thank you for taking the time to look at this!

sequence genome • 102 views
ADD COMMENTlink modified 16 days ago by WouterDeCoster39k • written 16 days ago by jamespdowling0
1

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 16 days ago by WouterDeCoster39k
1

duplicate: Merging position of all different CDS of a single gene in one line

ADD REPLYlink written 16 days ago by Pierre Lindenbaum120k

This worked - thank you. Fixed the problem.

ADD REPLYlink written 16 days ago by jamespdowling0

nice, close this post please.

ADD REPLYlink written 16 days ago by Pierre Lindenbaum120k

Check bedtools merge and its options -c and -o. From there on it is only a matter of string splitting to get the correct $9. Can be done with awk. Try it out, best way to learn ;-)

ADD REPLYlink modified 16 days ago • written 16 days ago by ATpoint17k

What have you tried before posting at SO and here? Can you paste any code that you'd written to attempt to do this?

I'd approach by storing the coordinates relative to each gene and output as min and max per gene.

ADD REPLYlink written 16 days ago by Eric Lim1.4k

Eric Lim does it apply for both genes on + and - strands ?

ADD REPLYlink written 16 days ago by cpad011211k

I don't think it'd make any difference within the context of what the OP asked.

ADD REPLYlink written 16 days ago by Eric Lim1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 816 users visited in the last hour