Entering edit mode
8.2 years ago
jon.brate
▴
290
I linearized a fasta file and counted the lengths of each sequence. Each line now consist of three tab separated columns, but column one also has a space in it.
>TCONS_00000098 gene=XLOC_000037 TGTGAACTGTTTGGAATGCCTAGATCATGATGAAGATTTTGGCGGCAAATCACGAACTACCAGATG 66
>TCONS_00000097 gene=XLOC_000037 TGTGAACTGTTTGGAATGCCTAGATCATGATGAAGATTTTGGCGGCAAATCACGAACTACCAGGTTGTGT 70
>TCONS_00000099 gene=XLOC_000037 TGAAGATTTTGGCGGCAA 18
>TCONS_00000100 gene=XLOC_000037 CAGATCGTCAAAAGTTTTTGAAGTTCCTCAAAAGAT 36
>TCONS_00000052 gene=XLOC_000022 AGCATTCG 8
>TCONS_00000025 gene=XLOC_000008 ACCGGTTTGCGTACTGATTTGCGTACTGGTTCGTGTA 37
>TCONS_00000132 gene=XLOC_000046 GTTTTAGTTGTTAGGTCTAACA 22
>TCONS_00000133 gene=XLOC_000046 CTGAGCAGTAACGCGACGCAGATCACTAAAGATCTG 36
I want to extract the longest isoform (TCONS...) of each gene, and I tried to sort the lines first on column 1, and then according to the lengths with the longest on top. I thought this command would work:
cat lengths.txt | sort -t ' ' -k1,1 -k3,3nr > sorted.txt
and it seems to somehow sort TCONS_00000097
right, but that is probably because of its name, not the length.
Output:
>TCONS_00000025 gene=XLOC_000008 ACCGGTTTGCGTACTGATTTGCGTACTGGTTCGTGTA 37
>TCONS_00000052 gene=XLOC_000022 AGCATTCG 8
>TCONS_00000097 gene=XLOC_000037 TGTGAACTGTTTGGAATGCCTAGATCATGATGAAGATTTTGGCGGCAAATCACGAACTACCAGGTTGTGT 70
>TCONS_00000098 gene=XLOC_000037 TGTGAACTGTTTGGAATGCCTAGATCATGATGAAGATTTTGGCGGCAAATCACGAACTACCAGATG 66
>TCONS_00000099 gene=XLOC_000037 TGAAGATTTTGGCGGCAA 18
>TCONS_00000100 gene=XLOC_000037 CAGATCGTCAAAAGTTTTTGAAGTTCCTCAAAAGAT 36
>TCONS_00000132 gene=XLOC_000046 GTTTTAGTTGTTAGGTCTAACA 22
>TCONS_00000133 gene=XLOC_000046 CTGAGCAGTAACGCGACGCAGATCACTAAAGATCTG 36