Yuriy Zisin Development as a lifestyle

19Feb/090

PHP DOMDocument class and HTML entities

DOMDocument class behaves strange sometimes. It could omit some entities like “ and some valid UTF-8 characters (it also may do so for other encodings). This probably could be fixed by using own DTD, but there is a simple way too. Each HTML entity has its binary code, so DOMDocument will export your entities correctly if you replace the entities by the appropriate codes. I have a small list of them. By using the following function you can avoid symbols loss:

function parseEntities($string) {
    $entities = array (
        "auml" => "ä",
        "ouml" => "ö",
        "uuml" => "ü",
        "szlig" => "ß",
        "Auml" => "Ä",
        "Ouml" => "Ö",
        "Uuml" => "Ü",
        "nbsp" => " ",
        "Agrave" => "À",
        "Egrave" => "È",
        "Eacute" => "É",
        "Ecirc" => "Ê",
        "egrave" => "è",
        "eacute" => "é",
        "ecirc" => "ê",
        "agrave" => "à",
        "iuml" => "ï",
        "ugrave" => "ù",
        "ucirc" => "û",
        "uuml" => "ü",
        "ccedil" => "ç",
        "AElig" => "Æ",
        "aelig" => "Ŋ",
        "OElig" => "Œ",
        "oelig" => "œ",
        "angst" => "Å",
        "cent" => "¢",
        "copy" => "©",
        "Dagger" => "‡",
        "dagger" => "†",
        "deg" => "°",
        "emsp" => " ",
        "ensp" => " ",
        "ETH" => "Ð",
        "eth" => "ð",
        "euro" => "€",
        "half" => "½",
        "laquo" => "«",
        "ldquo" => "“",
        "lsquo" => "‘",
        "mdash" => "—",
        "micro" => "µ",
        "middot" => "·",
        "ndash" => "–",
        "not" => "¬",
        "numsp" => " ",
        "para" => "¶",
        "permil" => "‰",
        "puncsp" => " ",
        "raquo" => "»",
        "rdquo" => "”",
        "rsquo" => "’",
        "reg" => "®",
        "sect" => "§",
        "THORN" => "Þ",
        "thorn" => "þ",
        "trade" => "™"
     );

    foreach ($entities as $ent=>$repl) {
        $string = preg_replace('/&'.$ent.';?/m', $repl, $string);
    }

    return $string;
}

This list contains not all the entities, but it is easy to inroduce new ones without any other code influence.