Mark search hits

A word about internationalization

Supporting characters of all (digitally encoded) languages requires proper configuration of all software components with regards to Universal Transformation Format (UTF). Typically, UTF-8 is used as it consumes just 8 bits to encode single-byte characters and up to 32 bits to encode multi-byte characters (symbols, emoticons). “Supporting” in this context means input, transmit, store, compare as well as output, which is why comprehensive support of UTF is quite a challenge, but well worth the effort.

Note that multi-byte characters are no special characters, a distinction important when it comes to sanitization.

It´s all about trust

It seems contradicting to XSS sanitize all output but at the same time enable markers that in order to work need to be interpreted as HTML format tags. But as sanitization and marking hits are different actions, it is no real contradiction, but just a matter of trust meaning that any input is treated as untrustworthy data, whereas the markers inserted by machine are trusted code

Any input data is untrustworthy

In the context of search hits the following pieces of data were or are input

  • text to scan (was input sometime in the past and persisted)
  • search expression (input now (and potentially cached))

As the text to scan was persisted, it first needs to be retrieved from the persistence layer. As described in the how-to about XSS protection, this text wasn´t sanitized on persisting in order to retain the original input regardless of international or special characters.

In contrary, the search expression has been entered, but typically is not persisted, but just cached in memory. There are two possible ways to match the search expression and the text

  • as both text and search expression have not been input-sanitized, compare them in their original (untrustworthy) format
  • sanitize both text and search expression before comparison

Looking ahead to the hit markers, only the latter is valid, because any untrustworthy data needs to be sanitized before being output. So the search algorithm searches the sanitized expression in the sanitized text and memorizes start and end positions of matches.

Hit markers are trusted

If there are matches, the hit marker algorithm inserts/ injects HTML font format tags into the sanitized text at the memorized start and end positions. Suppose a text to be My aunt is called “Joanne” and a search expression to be “Joa, the marked text could look like follows (backslashes inserted in order to prevent interpretation of special characters).

My aunt is called <font class=”hit”>&\quot;Joa</font>nne&\quot;

As can be seen, the trusted format tags are interpreted whereas the untrustworthy text input is not.

Example

The following example uses the Housekeeping web application. It performs a complex search using combined search expressions containing multi-byte characters and operands (“||” is interpreted as OR).

A note containing potentially executable characters
A note containing potentially executable characters

Remark: hits are distinguished as being global (orange), local (blue) and global as well as local (cyan).
As the XSS code demonstrates, the sanitized HTML text is not interpreted, but the hit markers are.

Summary

The following chart summarizes the procedure

Mark search hits
Mark search hits

Comments are closed.