DOMDocument class behaves strange sometimes. It could omit some entities like “ and some valid UTF-8 characters (it also may do so for other encodings). This probably could be fixed by using own DTD, but there is a simple way too. Each HTML entity has its binary code, so DOMDocument will export your entities correctly if you replace the entities by the appropriate codes. I have a small list of them. By using the following function you can avoid symbols loss:
function parseEntities($string) {
$entities = array (
"auml" => "ä",
"ouml" => "ö",
"uuml" => "ü",
"szlig" => "ß",
"Auml" => "Ä",
"Ouml" => "Ö",
"Uuml" => "Ü",
"nbsp" => " ",
"Agrave" => "À",
"Egrave" => "È",
"Eacute" => "É",
"Ecirc" => "Ê",
"egrave" => "è",
"eacute" => "é",
"ecirc" => "ê",
"agrave" => "à",
"iuml" => "ï",
"ugrave" => "ù",
"ucirc" => "û",
"uuml" => "ü",
"ccedil" => "ç",
"AElig" => "Æ",
"aelig" => "Ŋ",
"OElig" => "Œ",
"oelig" => "œ",
"angst" => "Å",
"cent" => "¢",
"copy" => "©",
"Dagger" => "‡",
"dagger" => "†",
"deg" => "°",
"emsp" => " ",
"ensp" => " ",
"ETH" => "Ð",
"eth" => "ð",
"euro" => "€",
"half" => "½",
"laquo" => "«",
"ldquo" => "“",
"lsquo" => "‘",
"mdash" => "—",
"micro" => "µ",
"middot" => "·",
"ndash" => "–",
"not" => "¬",
"numsp" => " ",
"para" => "¶",
"permil" => "‰",
"puncsp" => " ",
"raquo" => "»",
"rdquo" => "”",
"rsquo" => "’",
"reg" => "®",
"sect" => "§",
"THORN" => "Þ",
"thorn" => "þ",
"trade" => "™"
);
foreach ($entities as $ent=>$repl) {
$string = preg_replace('/&'.$ent.';?/m', $repl, $string);
}
return $string;
}
This list contains not all the entities, but it is easy to inroduce new ones without any other code influence.
Yuriy Zisin Development DOMDocument, entities, HTML, php, XML