CDATA in xml.. bad idea?

While working on a simple feed parser, I hit upon some wordpress feeds.

I noticed that wordpress feeds make heavy usage of CDATA to encode content. I always figured this was a bad idea if you cannot control what ends up in the xml feed. (Example here.).

Doing some googling to see if I'm not just kicking dust brought me to an xml.com article titled 'Escaped Markup Considered Harmful, which seems to agree with my standpoint for the following reason:

Escaping markup, particularly with CDATA sections, just doesn't work. There are other things that might be wrong that would make the documents not well formed. There are Unicode characters that are forbidden, there are encoding issues for the characters that are allowed, and there are sequences of characters that must be avoided. (e.g., "]]>"). Not to mention the fact that CDATA sections don't nest.

CDATA can't be used to just dump in any type of content that won't work in normal XML sections.. You're still obligated to make your data valid unicode. In fact, it's the opposite; There's no way you could ever escape the ]]> character sequence.

Web mentions

Comments

  • Michael Gauthier

    Michael Gauthier

    You encode ]]> inside a cdata section as ]]]]><![CDATA[>. This ends the current cdata section and begins a new one. If you use the DOM extension in PHP to generate your XML (as you should) it will escape such things for you automatically. Using the DOM extension can also fix character encoding problems as well.
  • Evert

    Evert

    The XMLWriter class actually doesn't do this, which is why I hit this problem in the first place.. Generally people seem to use CDATA sections thinking they can just dump in any binary data.. which is definitely not true.. So what's the benefit?
  • Edward Z. Yang

    Edward Z. Yang

    When I machine generate XML, I never use CDATA. And because I use DOM/XMLWriter, it's easy to make sure that everything is properly escaped. I suspect CDATA is being used so that people can <?php echo $data ?> However, when I hand-write XML, CDATA sections are extremely useful for doing things like program listings, etc. CDATA means you don't have to find all of your <s and escape them. So, in my opinion, the only reason why you would want to machine-generate a CDATA section would be to help out human readers. If XMLWriter doesn't escape ]]> properly, we've got a bug; report it! :-)
  • Evert

    Evert

    Gotcha: http://bugs.php.net/bug.php?id=44619 And while I was at it: http://bugs.php.net/bug.php?id=44620
  • Evert

    Evert

    Bug is considered bogus: http://bugs.php.net/bug.php?id=44619
  • Michael Gauthier

    Michael Gauthier

    Thanks for submitting the bug to PHP.net. It might be worth mentioning on the bug that the DOM extension does escape ]]> in CDATA sections.
  • Rafagd

    Rafagd

    Isn't easier to escape using $string = htmlspecialchars($string, ENT_QUOTES); before insert the string in DOM instead having to worry with CDATA sections?