Generate hashes for all sequences in a FASTA file
1
0
Entering edit mode
11 months ago
Prangan ▴ 20

Hello!

I am working on novel transcripts assembled from RNA-Seq data, using Stringtie. However, since stringtie "MSTRG" ids are poorly conserved across runs, I wanted to implement a strategy that converts all transcript sequences in a FASTA file to a sequence-specific hash, which can then be used as part of the header for identification of the transcripts. Any and all help is appreciated.

Thanks!

stringtie RNA hash • 576 views
ADD COMMENT
2
Entering edit mode
11 months ago

linearize, get the md5sum for each sequence.

cat input.fasta  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' | while read T S ; do echo -e "${T}" | cut -c2- | tr "\n" "\t" && echo -n "${S}" | md5sum  ; done | sed 's/ -$//'
gi|27592135 45fd3018a37826799cf5ceb93189e62e 
gi|13675786 51e96c6ff43eb067a0204ce2f82e9d92 
gi|13675777 5a4679ce61c2e2110a7f9bec1084e65b 
gi|84131965 71fcfc75699ccf252e0ce65e434a1c24 
gi|66260449 6b148daa7379b2822ca3ef9455b78bc6 
gi|33609016 71034abc928ac29402bc33d724b431bc 
(....)
ADD COMMENT

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6