improve html entity recognition. 1. recognize the new unicode references like &#[xX][0-9a-fA-F]+. c.f. http://www.unicode.org. 2. be very careful about determining the end of an entity reference. entities are a bit more restricted than html/xml CNAMEs, containing only [a-zA-Z0-9]. anything outside that is the end of a reference. this allows us to recognize "&
" as "&
" as the standard indicates. 3. no longer try substrings of the recognized entity name. this prevents us from fouling common cgi arguments like http://site.com?pie=x (intrepteted as http://site.com?πe=x). washingtonpost.com has examples of this.