Download the PHP package wp-php-toolkit/encoding without Composer
On this page you can find all versions of the php package wp-php-toolkit/encoding. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download wp-php-toolkit/encoding
More information about wp-php-toolkit/encoding
Files in wp-php-toolkit/encoding
Package encoding
Short Description Encoding component for WordPress.
License GPL-2.0-or-later
Homepage https://wordpress.github.io/php-toolkit/reference/encoding.html
Informations about the package encoding
slug: encoding title: Encoding install: wp-php-toolkit/encoding
see_also:
- html | HTML | Normalize incoming text before HTML tokenization.
- xml | XML | Keep invalid bytes out of XML streams.
-
dataliberation | DataLiberation | Clean content before importing it into WordPress.
UTF-8 validation and scrubbing with a pure-PHP fallback when mbstring is unavailable. Detects malformed bytes and replaces them per the Unicode maximal-subpart algorithm.
Why this exists
Every parser in this toolkit eventually has to decide what to do with text bytes. XML rejects malformed UTF-8. JSON and databases can fail late. CSS, HTML, WXR, and Blueprint validation all need consistent answers about whether a string is well-formed Unicode.
The Encoding component provides the small UTF-8 primitives the rest of the toolkit can share: validate bytes, scrub invalid sequences, scan code points, and detect Unicode noncharacters. When mbstring is available it can delegate to it; when it is not, the component uses its own byte scanner so behavior stays available in restricted PHP environments.
Historically, this became the common foundation for Blueprint validation and CSS/XML processing, replacing ad hoc Unicode helpers with the WordPress core UTF-8 routines used here.
Validating UTF-8 before storing it
wp_is_valid_utf8() rejects overlong sequences, surrogate halves, and stray ISO-8859-1 bytes. Use it as a guard in front of any code path that assumes UTF-8 (database, JSON, XML).
Scrubbing invalid bytes with U+FFFD
Replace each ill-formed sequence with the Unicode replacement character. Useful right before serializing to XML, JSON, or sending to an LLM that will choke on broken bytes.
Detecting noncharacters MySQL/utf8mb4 will reject
Code points like U+FFFE, U+FFFF, and the U+FDD0–U+FDEF block are valid Unicode but forbidden in XML and rejected by some databases. Check before inserting user-submitted content into a strict utf8mb4 column.
Three-way pipeline: validate, scrub, then check noncharacters
Real-world inputs are messy: an old WXR export, a CSV with mixed encodings, a paste from Word. Combination of validate + scrub + noncharacter-check covers the three classes of breakage that bite later.
Salvaging a legacy ISO-8859-1 column inside a UTF-8 corpus
Old WordPress databases sometimes mix encodings: most rows are UTF-8 but a few were stored as latin-1. Detect the bad rows with wp_is_valid_utf8() and only re-encode those.