May 10, 2013

Escaping in iCalendar and vCard

The #1 bug report in my vObject library (a library to parse and create iCalendar and vCard objects in PHP) is that it does a bad job escaping/un-escaping of values.

In particular, it double-escapes certain values, changing things like ; into \\; and in other cases it’s a bit too liberal un-escaping.

It’s gotten to a point where I got so frustrated about this bug, I’ve been working all week on a new version of the parser.

Determined to do things right this time, I wanted to make sure I complied with all the relevant standards, in particular:

vCard 2.1
vCard 3.0 (rfc2425, rfc2426)
vCard 4.0 (rfc6350)
iCalendar 2.0 (rfc5545)

When I first wrote the vObject I naively thought that these formats were more or less the same. On the surface it does indeed seem that way, everything does seem to follow this basic structure:

BEGIN:VCARD
VERSION:4.0
FN:Evert Pot
END:VCARD

The nuances and slight difference between the specifications are enough to drive a simple person to madness though.

Just on the topic of ecaping values (the part after the :) the specifications have the following to say:

vCard 2.1

vCard 2.1, as well as the other specs have a concept of ‘compound’ or multi-value properties. An example:

BEGIN:VCARD
VERSION:2.1
N:Pot;Evert;Middle;Dr.;M.D.
END:VCARD

As you can see, the N property has multiple values. Any of these values may also contain a ;, which must be escaped as \;. So we also cannot blindly encode a string and automatically add backslashes to any ; we see.

The semi-colons should only be escaped in the ADR, ORG and N fields, but we can assume that backslashed semi-colons may also appear in other values.

Any property may have a parameter, a parameter looks a bit like this:

BEGIN:VCARD
VERSION:2.1
NOTE;ENCODING=QUOTED-PRINTABLE:Handsome guy, for sure..
END:VCARD

A parameter in vCard starts with a ;, has a name and a value. Only the colon may be escaped in parameters, using \:.

If you somehow wanted to encode a real backslash though, there’s no mention of escaping it as a double-backslash.

If you need newlines in any values, quoted-printable encoding must be used. Other specs all encode newlines as \n or \N.

vCard 3.0

rfc2425 says that backslashes (\\), newlines (\N or \n) and comma’s (\,) must always be escaped, no exceptions.. Well except when the comma is used as a delimiter for multiple values.

rfc2426 add semi-colon (\;) to this list, except when it’s used as a delimiter. Semi-colon is used as a delimiter in the N, ADR, GEO and ORG fields. NICKNAME and CATEGORIES use comma’s.

vCard also says that individual parts of ADR, and N may also contain multiple values themselves, which are themselves split by a comma.

Quoted-printable is now deprecated, and should no longer be used.

Parameters have also changed. The new rule is that parameters must not contain ;, : or ", unless they are surrounded by double-quotes, in which case only " may not appear. Escaping of the colon character (\:) has disappeared.

vCard 4.0

vCard 4 changes the interpretation of 3.0 a bit, and now states that semi-colons may be escaped, depending on the property.

The implication is that we need to maintain lists of properties, if they support multiple- or compound-values and which delimiter they use (, or ;).

Semi-colons are now used by N, ADR, ORG and CLIENTPIDMAP. Comma’s are used by NICKNAME, RELATED, CATEGORIES and PID.

Even though the spec does say that comma’s must always be escaped, it does appear to violate this rule in it’s own examples, specifically the example for GEO (which is no longer a compound float value, but a url).

iCalendar 2.0

iCalendar 2.0 largely follows the same rules as vCard 4.0, but commas and semi-colons must be esacped, unless they are used as a delimiter.

Semi-colons are used as a delimiter in REQUEST-STATUS, RRULE, GEO and EXRULE, every other property uses commas.

rfc6868

One major flaw in all the above specs was that it was not possible to encode just any value as a parameter. Newlines are not allowed, and in no case can you encode a double-quote.

rfc6868 updates both iCalendar 2.0 and vCard 4.0 to use caret (^) as an escape character. To write a double quote, use ^', to encode a newline use ^n and to encode a caret, use ^^.

The hard part

A simple generic parser is with this information simply out of the window, not only will my parser have to be aware which document it’s parsing, it will also have to make individual decisions based on which property it’s parsing.

Researching and listing these rules helped, and I hope it’s also helpful for a future implementor.

It’s important to be strict in generating these formats, but considering the complexity of these rules, it’s extremely likely other software has bugs when generating these (and they do! a lot!) any parser needs to be able to handle these mistakes and attempt to make logical decisions based on what likely the intention was.

Found mistakes?

You can fork this post or easily make edits of this post on Github.

Web mentions

Comments

Marten • May 10, 2013

the GEO example in RFC6350 doesn't violate the comma rule, because that applies to TEXT values only, but as you correctly state GEO is a URI in vCard 4.0. The same is true for PHOTO and KEY, which are URIs and may contain commas as such.
You have to distinguish between TEXT, TEXT-LIST (a comma separated list of TEXT values), structured text types (semiolon separated TEXT values, like in N and ADR) and non-text types.
iCalendar doesn't know sturctured text types and doesn't need to escape semicolons (since they are not used as delimiter)
Many text-properties can be text-lists too, so it makes sense to always escape commas.
- Evert • May 10, 2013
  
  Well, if you read :
  http://tools.ietf.org/html/...
  It doesn't explicitly talk about TEXT or TEXT-LIST (unless I'm reading it incorrectly). The comma issue for GEO also appears in the Errata section for that document.
  - Marten • May 10, 2013
    
    it does, see http://tools.ietf.org/html/...
    Read the Errata again, that's a different issue. The Errata mentiones the GEO parameter to the ADR property not the GEO property itself. I can't see anything about the GEO property.
    The subject of that Errata is the semantic difference between
    TYPE="cell,voice"
    and
    TYPE=cell,voice
    - Evert • May 10, 2013
      
      You're right, and the the Data-Types section does seem to confirm that, but then there's still this sentence in section 3.4:
      Some properties may contain one or more values delimited by a COMMA
      character (U+002C). Therefore, a COMMA character in a value MUST be
      escaped with a BACKSLASH character (U+005C), even for properties that
      don't allow multiple instances (for consistency).
      - Marten • May 10, 2013
        
        True, but I think that makes sense for text values only. Integers, Booleans, floats, dates and times can't contain a comma by definition, so there is nothing to escape.
        But there is no "URI-list" value type. Also URI refers to http://tools.ietf.org/html/... which explicitly allows commas. On the other hand the BFN in RFC 6350 explicitly states that commas in text values must be escaped.
        I think there is more evidence that commas in URIs are ok than the opposite.
        I think that calls for another Errata to clarify this issue (and state either the one or the other).
        
        Evert • May 10, 2013
        
        I guess we can agree that the spec disagrees with itself? ;)
        
        Marten • May 10, 2013
        
        Yes
- Marten • May 10, 2013
  
  I have to correct myself: Of course iCalendar uses structured values and of course semicolons need to be escaped. It's RFC 2425 that doesn't know structured values.
  I guess I should go home and stop working for today ;-)
Marten • May 10, 2013

Btw. the comma in the NOTE of your second vCard 2.1 example is not escaped properly ;-)
- Evert • May 10, 2013
  
  Comma's are not escaped in vCard 2.1 :P Only semi-colon, and then only in N, ADR and ORG :)
  - Marten • May 10, 2013
    
    Yes, they are.
    See the FN example in RFC 2426:
    FN:Mr. John Q. Public\, Esq.
    also see the BNF at the end:
    text-value-list = 1*text-value *("," 1*text-value)
    text-value = *(SAFE-CHAR / ":" / DQUOTE / ESCAPED-CHAR)
    ESCAPED-CHAR = "\\" / "\;" / "\," / "\n" / "\N")
    ; \\ encodes \, \n or \N encodes newline
    ; \; encodes ;, \, encodes ,
    without escaping commas you wouldn't be able to distingish the commas from a list separator
    - Evert • May 10, 2013
      
      Yes, but RFC2426 is not vCard 2.1! It's 3.0.
      Check out:
      http://www.imc.org/pdi/pdip...
      It has very different language around escaping:
      Compound property values are property values that also make use of the Semi-colon, field delimiter to separate positional components of the value. For example, the Name property is made up of the Family Name, Given Name, etc. components. A Semi-colon in a component of a compound property value must be escaped with a Backslash character (ASCII 92).
      - Marten • May 10, 2013
        
        Ah, yea sorry I mixed the RFCs up. Who renders vCard 2.1 these days anyway ;-D
        
        Evert • May 10, 2013
        
        Microsoft ;) Also a lot of 'export' functionality still used vCard 2.1 by default. It's balls, but I get a lot of questions about them..
        
        Marten • May 10, 2013
        
        Ok, who uses Microsoft these days anyway ;-D
        My app handles vCard 2.1 just like vCard 3.0. That means it uses the same parser (that un-escapes all commas) for both versions but it always renders vCard 3.0. So far no one complained about missing slashes in front of his commas ;-)
        
        Evert • May 10, 2013
        
        Yea I think for reading 2.1 you're probably mostly fine with that behavior. It's what I've been doing in the past too, and I haven't gotten complaints about that specifically.