Question

Compare elements between rows and columns

0

Entering edit mode

3.3 years ago

vinayjrao ▴ 250

Hi,

I have a file containing a list of mutations of some genes, with the format -

rsID             Location                  GeneID      Feature                Exon       NucleotideSubstitution      AminoAcidSubstitution      cDNA      Type
rs541031071      11:47331423-47331423      MYBPC3      ENST00000256993.8      34/34      aaG/aaT                     K/N                        4200      3_prime_UTR
rs541031071      11:47331423-47331423      MYBPC3      ENST00000399249.6      35/35      aaG/aaT                     K/N                        4200      3_prime_UTR
rs541031071      11:47331423-47331423      MYBPC3      ENST00000387238.6      -          aaG/aaT                     K/N                        4200      upstream_gene_variant

Our focus is to identify unique variants, but as shown above the two variants are the same except for some columns being different, which in this case are 'Feature' and Exon. Is there a way for me to check if rows 1 to n of column 1 are equal, check the exon column to pick out the longest isoform from column 5 (one with 35 exons)?

My (hasty) solution to this would be to delete the variable columns (except Exon), grep all rows containing 35 exons, and then remove duplicates, if any. I would like to know if there is a cleaner and more sophisticated way of doing the same.

Thanks in advance,
Vinay

awk shell RNA-Seq • 709 views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 3.3 years ago by vinayjrao ▴ 250

score 2 · Accepted Answer · 2021-01-08

2

Entering edit mode

3.3 years ago

5heikki 11k

Perhaps:

sort -t $'\t' -k1,1 -k5,5gr inputFile | sort -t $'\t' -uk1,1 --merge

If it's a really large file you should also set -S --parallel and perhaps -T, see man sort

ADD COMMENT • link 3.3 years ago by 5heikki 11k

0

Entering edit mode

This works perfectly! I was wondering if I could modify the command in such a way that (added to the question are the last row and column) I separately pick the Type of variant without bothering about the Exon column.

I tried sort -t $'\t' -k1,1 -k6,6gr inputFile | sort -t $'\t' -uk1,1 --merge and sort -t $'\t' -k1,1 -k6,6gr inputFile | sort -t $'\t' -uk6,6 --merge, both with and without the -u option, but I only retrieve 3_prime_UTR ( while I'm trying to retrieve both3_prime_UTR and upstream_gene_variant). This is not a key requirement, it's just out of curiosity, so put in too much time with this.

Thank again :)

ADD REPLY • link 3.3 years ago by vinayjrao ▴ 250

0

Entering edit mode

Consider this mock file:

id1     60      up
id1     50      up
id1     40      down
id2     60      up
id2     90      down
id2     90      down
id3     70      up
id3     71      up
id3     90      down

I want the rows with the highest value from the 2nd column for unique 1st + 3rd column combinations:

awk 'BEGIN{OFS=FS="\t"}{print $1"_"$3,$2}' file
id1_up  60
id1_up  50
id1_down        40
id2_up  60
id2_down        90
id2_down        90
id3_up  70
id3_up  71
id3_down        90

Like this:

awk 'BEGIN{OFS=FS="\t"}{print $1"_"$3,$2}' file | sort -t $'\t' -k1,1 -k2,2gr | sort -t $'\t' -uk1,1 --merge
id1_down        40
id1_up  60
id2_down        90
id2_up  60
id3_down        90
id3_up  71

Almost back to the original format:

awk 'BEGIN{OFS=FS="\t"}{print $1"_"$3,$2}' file | sort -t $'\t' -k1,1 -k2,2gr | sort -t $'\t' -uk1,1 --merge | sed 's/_/\t/'
id1     down    40
id1     up      60
id2     down    90
id2     up      60
id3     down    90
id3     up      71

Back to the original format:

awk 'BEGIN{OFS=FS="\t"}{print $1"_"$3,$2}' file | sort -t $'\t' -k1,1 -k2,2gr | sort -t $'\t' -uk1,1 --merge | sed 's/_/\t/' | awk 'BEGIN{OFS=FS="\t"}{print $1,$3,$2}'
id1     40      down
id1     60      up
id2     90      down
id2     60      up
id3     90      down
id3     71      up

!! NOTE !!

Choose your "join separator" wisely, i.e. know your data. Here sed replaces the 1st occurrence of an underscore with a tab. This works because there is no underscore in the 1st column of my mock data

!! NOTE !!

ADD REPLY • link 3.3 years ago by 5heikki 11k