CDATA in xml.. bad idea?
While working on a simple feed parser, I hit upon some wordpress feeds.
I noticed that wordpress feeds make heavy usage of CDATA to encode content. I always figured this was a bad idea if you cannot control what ends up in the xml feed. (Example here.).
Doing some googling to see if I'm not just kicking dust brought me to an xml.com article titled 'Escaped Markup Considered Harmful, which seems to agree with my standpoint for the following reason:
Escaping markup, particularly with CDATA sections, just doesn't work. There are other things that might be wrong that would make the documents not well formed. There are Unicode characters that are forbidden, there are encoding issues for the characters that are allowed, and there are sequences of characters that must be avoided. (e.g., "]]>"). Not to mention the fact that CDATA sections don't nest.
CDATA can't be used to just dump in any type of content that won't work in normal XML sections.. You're still obligated to make your data valid unicode. In fact, it's the opposite; There's no way you could ever escape the ]]> character sequence.
Comments
Michael Gauthier •
You encode ]]> inside a cdata section as ]]]]>. This ends the current cdata section and begins a new one. If you use the DOM extension in PHP to generate your XML (as you should) it will escape such things for you automatically.Using the DOM extension can also fix character encoding problems as well.
Evert •
The XMLWriter class actually doesn't do this, which is why I hit this problem in the first place..Generally people seem to use CDATA sections thinking they can just dump in any binary data.. which is definitely not true..
So what's the benefit?
Edward Z. Yang •
When I machine generate XML, I never use CDATA. And because I use DOM/XMLWriter, it's easy to make sure that everything is properly escaped. I suspect CDATA is being used so that people canHowever, when I hand-write XML, CDATA sections are extremely useful for doing things like program listings, etc. CDATA means you don't have to find all of your
So, in my opinion, the only reason why you would want to machine-generate a CDATA section would be to help out human readers. If XMLWriter doesn't escape ]]> properly, we've got a bug; report it! :-)
Evert •
Gotcha:http://bugs.php.net/bug.php?id=44619
And while I was at it:
http://bugs.php.net/bug.php?id=44620
Evert •
Bug is considered bogus:http://bugs.php.net/bug.php?id=44619
Michael Gauthier •
Thanks for submitting the bug to PHP.net. It might be worth mentioning on the bug that the DOM extension does escape ]]> in CDATA sections.Rafagd •
Isn't easier to escape using$string = htmlspecialchars($string, ENT_QUOTES);
before insert the string in DOM instead having to worry with CDATA sections?