Question

Generate hashes for all sequences in a FASTA file

0

Entering edit mode

11 months ago

Prangan ▴ 20

Hello!

I am working on novel transcripts assembled from RNA-Seq data, using Stringtie. However, since stringtie "MSTRG" ids are poorly conserved across runs, I wanted to implement a strategy that converts all transcript sequences in a FASTA file to a sequence-specific hash, which can then be used as part of the header for identification of the transcripts. Any and all help is appreciated.

Thanks!

stringtie RNA hash • 576 views

ADD COMMENT • link updated 11 months ago by GenoMax 141k • written 11 months ago by Prangan ▴ 20

score 2 · Accepted Answer · 2023-05-17

linearize, get the md5sum for each sequence.

cat input.fasta  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' | while read T S ; do echo -e "${T}" | cut -c2- | tr "\n" "\t" && echo -n "${S}" | md5sum  ; done | sed 's/ -$//'

gi|27592135 45fd3018a37826799cf5ceb93189e62e 
gi|13675786 51e96c6ff43eb067a0204ce2f82e9d92 
gi|13675777 5a4679ce61c2e2110a7f9bec1084e65b 
gi|84131965 71fcfc75699ccf252e0ce65e434a1c24 
gi|66260449 6b148daa7379b2822ca3ef9455b78bc6 
gi|33609016 71034abc928ac29402bc33d724b431bc 
(....)