https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Control and undefined character, isolated surrogates and carriage returns are handled according to the standard
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#preprocessing-the-input-stream