wxJSON internals: C/C++ comments storage

Introduction

Starting from release 0.2, the wxJSON library recognizes and stores C/C++ comments in JSON value objects. Comments are stored in a wxJSONValue's data member which is of type wxArrayString.

An array is used because a value may have more than one comment line. When a JSON value object is created, the comment's array is empty.

There are two methods for adding comment lines to a JSONvalue object:

using member functions
reading them from a JSON text

Adding C/C++ comments using member functions

The wxJSONValue object defines some member function in order to add C/C++ comments to a value object. The value object contains an array of string for storing comments and an enumerator for setting the position of the comment line(s):

comments may be written inline
comments may be written before the value
comments may be written after the value

Note that the position of the comment is not an array of enums but it only is a simple integer data member. This means that the position of the comment is set for all comment's strings; all comments may apear inline, before or after the value they refer to and you cannot have mixed positioned comments.

You can add C-style comments and / or C++ comments. When you add a comment string you have to be sure that the string is a correct C/C++ comment. In other words, if you want to add a C++ comment string, the string passed as a parameter to the wxJSONValue::AddComment() function must start with two slash characters and must end with a LF. If the LF character is missing, the function adds it for you. The following code fragment shows some examples:

  wxJSONValue v1( 10 );
  v1.AddComment( "// A C++ comment line\n" );     // this is OK
  v1.AddComment( "// Another C++ comment line" ); // this is OK
  v1.AddComment( "/*  A C-style comment */");     // OK

  wxJSONValue v2( 20 );
  v2.AddComment( "A C++ comment line\n" );     // Error
  v2.AddComment( "/ A C++ comment line\n" );   // Error
  v2.AddComment( "/*** comment **" );          // Error

  // the following is OK: new-line characters may follow
  // the end-comment characters of a C-style comment
  wxJSONValue v3( 30 );
  v2.AddComment( "/*** C comment ***/\n\n\n" );

Note that the function cannot trap all possible errors because the checks that are done by the function are very simple:

for C++ comments checks that the string starts with two slash characters and ends with a LF char: the LF har is automatically added if it is missing
for C-style comments check that the string start with the slash-asterisk character couple and ends with the asterisk-slash caracters (trailing LF chars are permitted)

Note that the following examples are considered OK by the function but if you add those strings to some values and write them to a JSON text stream you end up with a incorrect JSON text.

  // the following is not correct: the AddComment() function only
  // appends the final LF char 
  wxJSONValue v1( 10 );
  v1.AddComment( "// Line 1\nLine2" );

Writing the above value you get:

  ...
  // Line 1
  Line 2
  10
  ...

You would have to write:

  wxJSONValue v1( 10 );
  v1.AddComment( "// Line 1" );
  v1.AddComment( "// Line 2" );

Nested C-style comments are not handled by the wxJSON parser:

  wxJSONValue v2( 20 );
  v2.AddComment( "/* comment1 /* comment2 */ */" );

Writing the above value to a JSON text string you get:

  ...
  /* comment1 /* comment2 */ */
  20
  ...

The parser will report an error when it reads the last close-comment characters because when a C-style comment starts, all characters until the first close-comment chars are ignored by the parser.

Using multiple inline comments

Comments may apear inline, after or before the value they refer to. When using the wxJSONValue::AddComent() function you can mix these positions for different values as, for example:

  wxJSONValue root;
  root["key-1"] = "value-1";
  root["key-2"] = "value-2";
  root["key-3"] = "value-3";

  root["key-1"].AddComment( "// comment before", wxJSONVALUE_COMMENT_BEFORE );
  root["key-2"].AddComment( "// comment inline", wxJSONVALUE_COMMENT_INLINE );
  root["key-3"].AddComment( "// comment after", wxJSONVALUE_COMMENT_AFTER );

When you write the above JSON value you get:

{
  // comment before
  "key-1" : "value-1",
  "key-2" : "value-2", // comment inline
  "key-3" : "value-3"
  // comment after
}

This is not really good: you have to choose a position between comments before and comments after. Inline comments may apear in both cases. Note that comments inline are only indented for the first line of comments. if you add more than one line of comment in inline position, the first line is written on the same line as the value it refers to and the other comment's lines are written after the value. Example:

  "key-2" : "value-2", // comment inline (line 1)
// comment inline (line 2)
// comment inline (line 3)
  "key-3" : "value-3"

The output is not very good so use inline comments only for very short comment's strings.

Reading and storing C/C++ comments from JSON text

Comment lines are stored when they are encontered in the JSON input text stream. The parser stores comments in the value they refer to. In order to know to which value a comment line refers, the following rules are followed:

if the comment line and the value apear on the same line the comment line refers to that value
if the flag wxJSONREADR_COMMENTS_AFTER is not set, a comment line refers to the value that follows the comment line(s)
if the flag wxJSONREADR_COMMENTS_AFTER is set, a comment line refers to the value that apears immediatly before the comment line(s)

Below you find some examples:

// comment for root (line 1)
// comment for root (line 2)
{
  "key-1" : "value1",

  // comment before 'key2'
  "key-2" : "value2",

  "key-3" : {
     "key3-1" : "value3-1",
     "key3-2" : "value3-2"   // comment inline for 'key3-2'
  },

  "key-4" : {   // comment inline for 'key4'
  },

  "key-5" : "value-5"
  // if the flag wxJSONREADER_COMMENT_AFTER is set, this
  // comment would refer to 'key-5'
}

The parser will store the comment in the value it refers to and in the same position it was encontered: if a comment is on the same line as the value, the comment will be stored as an inline comment: when the value will be written to a JSON text stream, the comment will be written in the same position unless you specify a flag that forces the writer to write comments in a particular position.

Choosing the value

How does the parser choose the correct value to store the comment?

The parser uses some data members to store pointers to values:

m_next: this pointer holds the value object that will hold the next value that has to be read. When the comments apear before the value, the comment is added to this value, unless it is NULL. In the ezample above, when the first two comment lines are read, this pointer points to the root value object.

m_current: this pointer points to the value object that the parser is reading at the moment. It can be NULL if the parser has not yet read a value. In the ezample above, when the first two comment lines are read, this pointer is NULL. When the first open-object char is read, this pointer points to the root value.

m_lastStored: this pointer points to the value that was last stored in an object or in an array. Comments are added to the value pointed to by this data member when the wxJSONREADER_COMMENTS_AFTER is set.

In addition, the parser stores the line number of every comment line that was read and also the line number of the value when it is read: if the line numbers match, the comment is added to the matching value as an inline comment.

A complex example

Below you find a more complex example and the explanation of every operation done by the parser when the comment lines are encontered. The numbers on the first two columns refers to the line numbers of the JSON text input stream. The comment flag is BEFORE:

 1 // comment for root (line 1)
 2 // comment for root (line 2)
 3 {
 4   "key-1" : "value1",
 5
 6   // comment before 'key2'
 7   "key-2" : "value2",
 8   // comment before 'key3' (1)
 9   // comment before 'key3' (2)
10
11   "key-3" : {
12      "key3-1" : "value3-1",
13
14      // comment before key3-2
15      "key3-2" : "value3-2"
16   },
17
18   "key-4" : {   // comment inline key4
19      // this comment does not refer to anything (19)
20   }
21
22   "key-5" : [ // comment inline key5
23
24      // comment before item 5-1
25      "item5-1",
26      "item5-2", // comment inline 5-2
27      "item5-3"  // comment inline 5-3
28
29      // this comment does not refer to anything (29)
30   ],
31
32   "key-6"
33      :        // comment inline key-6
34        "value",
35
36   "key-7" : {
37      "key-7-1" : "value-7-1"
38   },        // comment inline key-7
39
40   "key-8"     // comment inline key-8(1)
41      :        // comment inline key-8(2)
42       value,  // comment inline key-8(3)  // ERROR: string-value not quoted
43
44   "key-8"     // comment inline key-8(1)
45      :        // comment inline key-8(2)
46      "value", // comment inline key-8(3)
47
48   "key-9" : {
49      "key9-1" : 91,
50      "key9-2" : 92
51   }
52
53
54   "key-10" : [
55   ]            // comment inline key-10
56
57   // this comment does not refer to anything (57)
58 }
59 // this comment does not refer to anything
60 // if COMMENT_BEFORE.
61
62 This non-JSON text is ignored by the parser because
63 it apears after the top-level close-object character
64

The detailed explanation of the example

 1 // comment for root (1)
 2 // comment for root (2)
 3 {

  m_current    = NULL
  m_next       = the 'root' value
  m_lastStored = NULL

The first comment is read in the GetStart() function which searches for the first '{' (open-object) or '[' (open-array) characters. When the two comment lines are read, the pointers contains the following values:

Because the flag is COMMENT_BEFORE the comment is added to the value pointed to by m_next.

 3 {
 4   "key-1" : "value1",
 5
 6   // comment before 'key2'
 7   "key-2" : "value2",

  m_current    = NULL
  m_next       = the temporary \c value which will contain 'key-2'
  m_lastStored = key-1

This example is similar to the above. The top-level start-object character causes the parser to enter the DoRead() function which has the root value object as the parent parameter. Every value is stored in a temporary wxJSONValue object created on the stack which is kept in the variable value. The parser stores the value when the comma character is encontered or when the close-object char is encontered.

The comment line is added to the value pointed to by m_next.

 8   // comment before 'key3' (1)
 9   // comment before 'key3' (2)
10
11   "key-3" : {

  m_current    = NULL
  m_next       = the temporary \c value which will contain 'key-3'
  m_lastStored = key-2

This is the same as above. The only difference is that there are two comment lines but they are all added to the value pointed to by m_next. You can add as many lines as you want.

11   "key-3" : {
12      "key3-1" : "value3-1",
13
14      // comment before key3-2
15      "key3-2" : "value3-2"
16   },

  m_current    = NULL
  m_next       = the 'value' which will contain 'key3-2'
  m_lastStored = key3-1
\endcoe


This is very similar to the above cases. 

\code
17
18   "key-4" : {   // comment inline key4
19      // this comment does not refer to anything
20   }
21

   m_current    = key-4 (the parent)
   m_next       = temporary \c value object
   m_lastStored = NULL

When the two comments are read, the value of the pointer data members is the same for both comments but they are stored in a different value object. When the open-object character is read on line 18, the parser sets the m_current data member equal to the temporary value object on the stack which is the 'parent' of all values read until the close-object character. Then, the DoRead() function is called recursively, which allocates another temporary value object on the stack.

We have to examine the two comment lines differently:

the comment on line 18 is on the same line as the value pointed to by m_current so it will be added to that value as an inline comment.

the comment on line 19 cannot be added as an inline comment, so, because the comment flag is BEFORE, it is added to the value object pointed to by m_next: this comment would refer to the first value that has key-4 as the parent.

Because key-4 is empty, no value is added to it and the temporary value object created on the stack by the DoRead() function will be destroyed when the DoRead() function returns: the close-object character on line 20 causes the function to return. The comment line on line 19, stored in the temporary value object is lost, because thar object was not added to the parent,

25      "item5-1",
26      "item5-2", // comment inline 5-2

  m_current    = NULL 
  m_next       = temporary 'value' which will get 'item5-3' string
  m_lastStored = 'item5-2'

When the parser reads the comma character on line 26, it stores the temporary value object in the parent value object and sets the m_lastStored data member pointing this last stored value. The m_next pointer points to the temporary value object on the stack and it will get the item5-3 string value. The parser checks the line number of the comment line and searches a matching line value in the value objects pointed to by all the three pointers, provided that they are not NULL. A matching line number value is found in the value pointed to by m_lastStored so the comment is added to that value as an inline comment.

27      "item5-3"  // comment inline 5-3
28
29      // this comment does not refer to anything
30   ],

  m_current    = temporary 'value' which contains 'item5-3' string 
  m_next       = NULL
  m_lastStored = 'item5-2'

The value that holds the item5-3 string is not yet stored when the parser reads the comment on line 27. This is because the temporary value created on the stack is stored when the comma character is encontered or when the close-array character is encontered. The comment will be added to value pointed to by m_current because the line numbers are equal.

The comment on line 29 is lost because it is added to the temporary value object which is not stored. See the same situation on line 19.

31
32   "key-6"
33      :        // comment inline key-6
34        "value",
35

  m_current    = 'value' which contains 'key-6'
  m_next       = NULL
  m_lastStored = 'key-5'

The comment is added to the value pointed to m_current because the line numbers are equal. The parser updates the line number of a value each time it reads a character that belongs to a value: the semicolon character is one of these characters. Note that the comma is also a character that belongs to a value.

35
36   "key-7" : {
37      "key-7-1" : "value-7-1"
38   },        // comment inline key-7

  m_current    = NULL
  m_next       = 'value' which will contain 'key-8'
  m_lastStored = 'key-7'

The comma character on line 30 causes the parser to store the value key-7 in the parent object (the root object). The comment is added to the value pointed to by m_lastStored because the line numbers are equal. The comment's position is inline.

39
40   "key-8"     // comment inline key-8(1)
41      :        // comment inline key-8(2)
42       value,  // comment inline key-8(3)     // ERROR: value not quoted
43

Line 40 and 41
  m_current    = the temporary 'value' object
  m_next       = NULL
  m_lastStored = NULL

Line 42
  m_current    = NULL
  m_next       = the temporary 'value' object
  m_lastStored = NULL

This is a very particular situation that cause the parser to behave incorrectly. Note that the incorrect behaviour is a conseguence of a syntax error so it cannot be considered a bug.

The two comments on lines 40 and 41 are added to the value object pointed to by m_current and this is OK. When the value will be stored (if it could be) the comments refers to the corrent value as inline comments.

The parser reads the value string but reports an error: a string must be quoted. As a conseguence, the temporary value object has a key but it does not have a value. Next, the parser reads the comma character and calls the StoreValue() function which will add the temporary value object to the parent object. But the value cannot be added to the parent because it does not have a valid value. The conseguence is that the m_lastStored pointer is set to NULL because no value was actually stored. If the value would be stored, the comment line on line 42 will be correctly added to the value pointed to by m_lastStored because the line numbers are equal.

But when the parser reads the comment on line 42, the content of the m_lastStored pointer is NULL so the comment is added to the value pointed to by m_next because the comment flag is BEFORE. The incorrent behaviour is that the value of key key-8 (see below) will have four comment lines and not three as we expect.

-----

43
44   "key-8"     // comment inline key-8(1-1)
45      :        // comment inline key-8(1-2)
46      "value", // comment inline key-8(1-3)
47

  m_current    = NULL
  m_next       = 'value' which is empty
  m_lastStored = key-8

See the description of the above item. The difference is that this value is correctly added to the parent. As not expected, it has four comment lines instead of three.

53
54   "key-10" : [
55   ]            // comment inline key-10
56

  m_current    = temporary 'value' object
  m_next       = NULL
  m_lastStored = NULL

The comment is added to the value pointed to by m_current because the line numbers match. For an object-type value, the line number is updated when the parser reads the open- and -close object character.

56
57   // this comment does not refer to anything
58 }

  m_current    = the temporary 'value' object which contains 'key-10'
  m_next       = NULL
  m_lastStored = 'key-9'

The comment is lost because if should be stored in the value object pointed to by m_next but this pointer is NULL so the comment cannot be stored.

58 }
59 // this comment is not stored in the root value
60 // if COMMENT_BEFORE. If should be if COMMENT_AFTER.
61
62 This non-JSON text is ignored by the parser because
63 it apears after the top-level close-object character
64

  m_current    = NULL
  m_next       = NULL
  m_lastStored = the root value

Because the wxJSONReader::DoRead() function returned in line 58, all text after the top-level close-object or close-array character is ignored by the parser. For COMMENTS_BEFORE this is OK but if the flag was COMMENT_AFTER, the two comments on line 59 and 60 should be stored in the root value: but this does not happen, by now. This is not a bug but a precise desing choice;when the top-level wxJSONReader::DoRead() function returns, the parsing process ends.

So, if you want to store comments for the root value, never use the wxJSONREADER_COMMENTS_AFTER flag and put comments before the top-level start-object or start-array character.