ZvvyCode: A Universal Safe Text Encoding Method
Different programming languages use different methods of “escaping” such characters so they can be interpreted as data rather than carrying their usual syntactic meaning. And therein lies a problem for programmers. If you are dealing with multiple programming languages or systems, you sometimes need a way of encoding text so that ALL of the disparate systems it passes through will treat it as data and not mistake some character embedded within it as a signal to terminate the string or whatever.
I recently ran up against this issue while writing a web application that used Javascript, PHP, HTML, and MySQL, as well as the potential for passing bits of text as parameters embedded in a web URL. The standard functions for escaping text were not sufficient to ensure that ANY text entered by the user could make it through all parts of the program without tripping something up. So I decided to create a universal text encoding method that works for any text and any combination of programming languages and systems you might happen to be using.
Here were my requirements:
- Must be able to faithfully encode ANY text that’s valid in Unicode and then decode it to be identical to the original.
- After being encoded, the string must consist of nothing but the ASCII alphabetic letters (a-z, A-Z) and digits (0-9), plus a single “escape” character.
- An application must be able to designate ANY Unicode character code as the escape (except the ASCII letters and digits), so as to avoid collisions with whatever languages or systems it needs to be compatible with.
- It should not matter if the chosen escape character happens to occur in the text that is to be encoded.
- Any program must be able to decode the text without having prior knowledge of what character was used as the escape during encoding.
- The encoded text should be basically readable by a human.
I came up with a method that satisfies all of these requirements. I call it ZvvyCode. (That’s two v’s in “Zvvy”. I pronounce it “zivy”.)
The method is actually pretty simple. An encoded string consists only of the ASCII letters and digits (a-z, A-Z, 0-9) plus one special escape character that is designated at the time of encoding and may be anything other than the ASCII letters and digits. An encoded string can be decoded without advance knowledge of the escape character, because the escape character is read from the beginning of the encoded string.
The original ASCII letters and digits are kept in their original form. ALL other characters, including spaces, punctuation, extended ASCII codes, Unicode characters, etc. are converted to their hexadecimal character codes bounded on each end by the designated escape character. The encoder function accepts the original text and a numeric character code to be used as the escape. The encoded string begins with an 8-character sequence consisting of two instances of the escape character, the letters “zvvy”, and two more instances of the escape character.
The decoder only needs to be passed the encoded text. It looks at the first eight characters to determine (1) whether the string is, in fact, a ZvvyCode string, and (2) what character to interpret as the escape character for the decoding process. If the first eight characters do not match the ZvvyCode pattern, then the string is considered to not be ZvvyCode and is returned unchanged. Otherwise, the remaining string is decoded into its original form.
Example: Suppose you choose a slash as the escape character to encode this text:
I said, “Joe’s bag is 3/4 full!”
The ZvvyCode encoding would be:
//zvvy//I/20/said/2c//20//22/Joe/27/s/20/bag/20/is/20/3/2f/4/20/full/21//22/
Note that it doesn’t matter that the original text has a slash in it, even though the slash is to be used as the escape character. The slash is simply converted to /2f/
I use the characters “zvvy” simply because it’s a short sequence of letters that is very unlikely to appear in natural text. It’s even more unlikely that this sequence would naturally appear bounded on each end with two copies of some other character. Hence it’s a pretty safe bet that anything that begins with the 8-character ZvvyCode starting pattern is, in fact, a ZvvyCode string.
I’ve implemented encoders and decoders in both Javascript and PHP because those are the languages I’ve needed. If you implement ZvvyCode encoders and decoders in any other programming languages, I’d love to hear about it and would be happy to post your versions here.