{"id":1024,"date":"2019-04-21T23:44:49","date_gmt":"2019-04-21T23:44:49","guid":{"rendered":"http:\/\/reichartonline.de\/?page_id=1024"},"modified":"2020-05-10T08:55:17","modified_gmt":"2020-05-10T08:55:17","slug":"mark-search-hits","status":"publish","type":"page","link":"https:\/\/reichartonline.de\/?page_id=1024","title":{"rendered":"Mark search hits"},"content":{"rendered":"\n<h4 class=\"wp-block-heading\">A word about internationalization<\/h4>\n\n\n\n<p>Supporting characters of all (digitally encoded) languages requires proper configuration of all software components with regards to Universal Transformation Format (<a href=\"https:\/\/en.wikipedia.org\/wiki\/UTF\">UTF<\/a>). Typically, UTF-8 is used as it consumes just 8 bits to encode single-byte characters and up to 32 bits to encode multi-byte characters (symbols, emoticons). &#8220;Supporting&#8221; in this context means input, transmit, store, compare as well as output, which is why comprehensive support of UTF is quite a challenge, but well worth the effort. <br><br>Note that multi-byte characters are no special characters, a distinction important when it comes to sanitization.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">It\u00b4s all about trust<\/h4>\n\n\n\n<p>It seems contradicting to <a href=\"https:\/\/reichartonline.de\/?page_id=831\">XSS sanitize<\/a> all output but at the same time enable markers that in order to work need to be interpreted as HTML format tags. But as sanitization and marking hits are different actions, it is no real contradiction, but just a matter of trust meaning that <strong>any input<\/strong> is treated as <strong>untrustworthy<\/strong> data, whereas the <strong>markers<\/strong> inserted by machine are <strong>trusted<\/strong> code<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Any input data is untrustworthy<\/h4>\n\n\n\n<p>In the context of search hits the following pieces of data were or are input<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>text to scan (was input sometime in the past and persisted)<\/li><li>search expression (input now (and potentially cached))<\/li><\/ul>\n\n\n\n<p>As the text to scan was persisted, it first needs to be retrieved from the persistence layer. As described in the how-to about XSS protection, this text wasn\u00b4t sanitized on persisting in order to retain the original input regardless of international or special characters.<br><br>In contrary, the search expression has been entered, but typically is not persisted, but just cached in memory. There are two possible ways to match the search expression and the text<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>as both text and search expression have not been input-sanitized, compare them in their original (untrustworthy) format<\/li><li>sanitize both text and search expression before comparison<\/li><\/ul>\n\n\n\n<p>Looking ahead to the hit markers, only the latter is valid, because any untrustworthy data needs to be sanitized before being output. So the search algorithm searches the sanitized expression in the sanitized text and memorizes start and end positions of matches.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Hit markers are trusted<\/h4>\n\n\n\n<p>If there are matches, the hit marker algorithm inserts\/ injects HTML font format tags into the sanitized text at the memorized start and end positions. Suppose a text to be <em>My aunt is called &#8220;Joanne&#8221;<\/em> and a search expression to be <em>&#8220;Joa<\/em>, the marked text could look like follows (backslashes inserted in order to prevent interpretation of special characters).<\/p>\n\n\n\n<p class=\"has-text-color has-vivid-cyan-blue-color\">My aunt is called &lt;font class=&#8221;hit&#8221;&gt;&amp;\\quot;Joa&lt;\/font&gt;nne&amp;\\quot;<\/p>\n\n\n\n<p>As can be seen, the trusted format tags are interpreted whereas the untrustworthy text input is not.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Example<\/h4>\n\n\n\n<p>The following example uses the <a href=\"https:\/\/reichartonline.de\/?page_id=44\">Housekeeping web application<\/a>. It performs a complex search using combined search expressions containing multi-byte characters and operands (&#8220;||&#8221; is interpreted as OR).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/20190407_Housekeeping_xss_combined_search.png\"><img loading=\"lazy\" decoding=\"async\" width=\"245\" height=\"336\" src=\"http:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/20190407_Housekeeping_xss_combined_search.png\" alt=\"A note containing potentially executable characters\" class=\"wp-image-1017\" srcset=\"https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/20190407_Housekeeping_xss_combined_search.png 245w, https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/20190407_Housekeeping_xss_combined_search-219x300.png 219w\" sizes=\"auto, (max-width: 245px) 100vw, 245px\" \/><\/a><figcaption>A note containing potentially executable characters<\/figcaption><\/figure>\n\n\n\n<p>Remark: hits are distinguished as being global (orange), local (blue) and global as well as local (cyan).<br>As the <a href=\"https:\/\/reichartonline.de\/?page_id=831\">XSS<\/a> code demonstrates, the sanitized HTML text is not interpreted, but the hit markers are.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Summary<\/h4>\n\n\n\n<p>The following chart summarizes the procedure<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/MarkHits.png\"><img loading=\"lazy\" decoding=\"async\" width=\"760\" height=\"730\" src=\"http:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/MarkHits.png\" alt=\"Mark search hits\" class=\"wp-image-1023\" srcset=\"https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/MarkHits.png 760w, https:\/\/reichartonline.de\/content\/wp-content\/uploads\/2019\/04\/MarkHits-300x288.png 300w\" sizes=\"auto, (max-width: 760px) 100vw, 760px\" \/><\/a><figcaption>Mark search hits<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>A word about internationalization Supporting characters of all (digitally encoded) languages requires proper configuration of all software components with regards to Universal Transformation Format (UTF). Typically, UTF-8 is used as it consumes just 8 bits to encode single-byte characters and up to 32 bits to encode multi-byte characters (symbols, emoticons). &#8220;Supporting&#8221; in this context means [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":62,"menu_order":6,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"class_list":["post-1024","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/pages\/1024","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/reichartonline.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1024"}],"version-history":[{"count":37,"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/pages\/1024\/revisions"}],"predecessor-version":[{"id":1138,"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/pages\/1024\/revisions\/1138"}],"up":[{"embeddable":true,"href":"https:\/\/reichartonline.de\/index.php?rest_route=\/wp\/v2\/pages\/62"}],"wp:attachment":[{"href":"https:\/\/reichartonline.de\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}