Prefix parameters don't mix well with comments

Okay, this is really easy to implement, the question is just: what behaviour do we want?

It's worth understanding, that:

If you currently start any line with a [ (modulo leading whitespace) and do not end it with a ] (modulo trailing whitespace), that line is entirely ignored.
You cannot write [myprefix] # comment
You cannot write [myprefix] key = value

In its current form, this bug report is about the 2nd issue (which I've also been bitten by in the past). In addressing it, one could address the 3rd issue as well, though.

To do either requires the following change in the logic of the code: rather than check if there's a ] at the end of the line, we just have to check if there's a ] somewhere on that line. Then extract the prefix. Once that's done, one can either ignore anything else on the line (which would trivially solve the 2nd issue) or actually try to make sense of water comes after the ] (which would also take care of the 3rd point).

A look at the code tells me that whoever wrote it wanted it the parser be really lax: it will essentially never throw() (it only does if you provide the same key twice). So I don't see a point in deviating too much from the slightly confusing behaviour that is the 1st issue, in putting too much effort into checking that we're not given something evil like [myprefix # ], or even into giving helpful error messages: The current code does nothing of the sort (and I don't want to propose a complete rewrite at this point).

In summary, I would like to propose that we:

Either we carry out this change (edit: moved to !246 (merged)), which addresses the 2nd and 3rd point..

Or we carry out this change (edit: moved to !247 (closed)), which only addresses the 2nd point.

mentioned in commit a02303d1

mentioned in commit 9bc6e140

mentioned in merge request !246 (merged)

mentioned in merge request !247 (closed)

@pipping I think you overestimate how much thought the author invested when he wrote that code. To me it looks more like "I'll quickly hack together an .ini-parser that can handle the config files I throw at it; it does not have to be perfect". So some things get checked, and others don't, even though they really should.

!246 (merged) is OK as a quick fix. !247 (closed) perpetuates confusing syntax in config files and thus should be avoided.

Really, what we should do is:

strip anything starting with # from lines (as we already do for lines not starting with [)
trim whitespace from the result
if what remains is empty, ignore the line
if what remains is of the form [*], set a new prefix
if what remains is of the form *=* (leftmost =), set the key with the current prefix
otherwise, report an error.

OK, if we would do that, that would break quoted multiline values...

Well. I wouldn't mind writing a more robust and stricter parser either. But we'd probably end up with a few upset users who'd been relying on something that was never meant to work. I've looked a bit closer now and the logic that the parser employs is absolutely insane. These are some of my favourites:

Anything that comes after the character # is treated as a comment and ignored.
This is not true when you're setting a prefix. Then the entire prefix line is ignored.
But it is true on assignment lines. Even inside quotes!
Of course, in such a case, you first see an opening quote, then some text, and then you ignore the closing quote. So now you're in multiline mode, surprise!, and you have a closing quote on one of the following lines!
In multiline mode, comments are not ignored! So you could actually use multiline mode if you wanted to store a comment character in a string. Just be aware that comments behave differently on the first line of multline mode than everywhere else.
Also note that in multiline mode, newlines are preserved. Except for blank lines at the start. Those are special. They are all ignored.

Here's a bit of code that shows the behaviour I've described:

    std::stringstream s;
    s << "\
[prefix1] # comment\n\
key = value1\n\
\n\
[prefix2]\n\
key = value2\n\
\n\
[]\n\
k1 = v#ignored\n\
k2 = \"quoted text and comment # ignored\n\
# continue to the next line\n\
# and so on\"\n\
\n\
k3 = \"\n\
\n\
newlines are preserved\n\
\n\
except\n\
\n\
\n\
for those two in the very beginning\"\n\
";

    Dune::ParameterTree c;
    Dune::ParameterTreeParser::readINITree(s, c);

    // This fails
    // std::cout << c.get<std::string>("prefix1.key") << std::endl;

    // Prints "value2"
    std::cout << c.get<std::string>("prefix2.key") << std::endl;

    // Prints "v"
    std::cout << c.get<std::string>("k1") << std::endl;

    // Prints "quoted text and comment
    //         # continue to the next line
    //         # and so on
    std::cout << c.get<std::string>("k2") << std::endl;

    // Print "newlines are preserved
    //
    //        except
    //
    //
    //        for those two in the very beginning"
    std::cout << c.get<std::string>("k3") << std::endl;

I think things would be much easier if one had not allowed the lazy alternative foo = bar buz to foo = "bar buz". Maybe one could propose a new config format format for dune 3.0, have both the old and the new in 2.6, and switch to the new one completely in 3.0. Either way, I just lost my hope of getting anything with the current parser "right" while also remaining backward-compatible.

mentioned in merge request !248 (closed)

See !248 (closed) for my attempt on this.

@pipping: +1 for @joe's comment on the intentions of the author: Don't expect any deep thought behind this. The original "parser" is about 12 years old and was written with 6 months of C++ experience. While the code was extended and cleaned up a lot, the basic syntax (and thus the lax behaviour of the parser) essentially stayed "compatible".

@pipping: A clean syntax and a robust parser would be very welcome. The compatibility issue could be easily handled by providing the old and the new parser in one release. However, I'd then prefer to not try to fix the old parser. Instead one should start by writing up the exact syntax (including corner cases).

@carsten.graeser: Right. If we provide a new format anyway, there's absolutely no reason to extend the old parser. When you said "start by writing up the exact syntax", you're referring to a new parser only, I take it? I'll get on that then.

@pipping: Yes, the old one is already "specified by implementation" ;-). Let's try to avoid this for the new one.

I'll drop here quickly three issues that I have/had with config files.

The config files are a list while the ParameterTree has a tree structure. The nestedness is only represented via dot-separated scopes.
Parameter subsets needed at two different places need to be copied by hand.
The ParameterTree report() is far less easily readable than my own original config and is therefore inconvenient to reuse.

Maybe it is worth considering some XML-like format? (Not saying that I am necessarily a fan of that for this application.)

@max.kahnt:

Re 1.: I don't see the issue. This is similar to how e.g. git handles its config files. The best way to access the information programatically is definitely a tree structure. There are several ways to write it down in a config file and some extended ini style is used by a lot of programs, so should be familiar to lots of people.

Re 2.: I'm not quite sure what exactly you want here?

Re 3.: IMO report() is intended for debugging, not generally for writing actual config files. If you want something to be as easily readable as your original config, you need to invest a lot. You need to preserve comments, whitespace, etc. There are some systems that kind of do that: git does something like that with its config files, the YAML (or was it TOML?) spec has some provisions for that, if I remember correctly. But that would certainly not be something we do ourselves in dune, we'd need an external lib for that, meaning an additional dependency.

Re XML: No. Just no. XML is not human readable. If you want something fancy, look at YAML or TOML, or heck, python or json.

@carsten.graeser What is the plan here? I remember something about YAML from the developer meeting, but I don't remember the context, and I can't find anything in the meeting minutes.

I doubt that it is a good idea to introduce yet another config file format. So either we get the existing format somewhat sane (because we already have that). Or we adopt some existing format, but that almost certainly means dependency on another lib, because config file formats are certainly not our area of expertise.

@max.kahnt: I support @joe here. XML is hard to read, hard to write, and its attributes don't map well to ParametrerTree. I'd prefer fixing our ini-like format. As a standardized alternative I'd prefer json, because it's very simple and readable.

Luckily the parser is independent of the data structure. So one can easily add parsers for other formats. However, this would add dependencies since we should definitely avoid writing our own json, xml, ... parser.

@joe: I never prososed YAML. JSON would be nice and I vaguely remember that someone (not me) found a header-only parser with a very liberal licence.

@joe I don't plan on proposing a very different config file format. Just basically what we have, without the rough edges, and with sane error handling. Something where a migration would either require no effort at all or be trivial.

It's worth noting that not all dependencies are created equal in this context: even if one chose not to write a parser by hand, the alternative would be to go through a parser generator. The generated code could be checked into git and only those that touch the parser would have to have it installed.

I'd like to stress that getting this type of parser right is actually really easy. The language at hand can be tackled by the simplest of parsers (an LL(1) parser) as far as I can tell. The code will be comprehensible.

@carsten.graeser Well, let me know. If you'd actually much rather have json than ini-file-format, I don't need to waste any more effort on this.

I had no intention to actually push XML or alike. I just thought it'd be nice to mention what I think could improve convenience of usage aside from a general reliability revision as proposed in the first place.

My 3 bullet points were also motivated by the idea that there might already exist syntax concept, implementations and additional infrastructure around configuration files that does a lot more than any Dune developer is willing to provide.

I did not want to re-start a discussion that might have already taken place at a developer meeting.

@pipping: I would not want to replace our hierarchical ini-extension, because it's in my opinion still more suited for the task of a simple parameter file. But if we additionally add a standardized format, then I'd like to have JSON.

It's maybe also worth mentioning again, that the boost property tree library does essentially the same as our ParameterTree. It contains parsers for XML, JSON, INI (plain, MS-style), and (ta-da!) for their own invention: the INFO-format.

@carsten.graeser I see. Great. So since the ini format is not going away then, I've continued to work on this, mostly driven by the desire to not have dune-common depend on boost.

I think now is a good time to present what I currently have: A code that I believe to work, i.e. do what I wanted it to do. The complexity of the code is currently not very high. The main method (that's basically all there is) fits into ~100 LOC and I don't think it's the kind of code that makes your head spin. So if you think that this is worth pursuing (which I hope), then I'd write down a proper spec of what exactly the language is that this parser supports and the spec could be discussed.

For now, I'll leave you with the code that you can throw things at. (edit: Code moved to !249).

@carsten.graeser One header only library for JSon is called cJSON

An other one which looks very promising is https://github.com/nlohmann/json which provides a very C++-11 style interface.

@pipping I might not the crucial commenter, but I believe it's worth the effort despite the JSON discussion going on in parallel.

@max.kahnt Thanks. I've turned it into !249 now. This form might make it easier to compare the old and the new parser (they can be tested on the same input). And thanks to CI, unit tests are also run.

@pipping Your parser is pretty radical, there seem to be a number of differences to the current parser, can you maybe summarize those?

@joe The language that the parser accepts is: Update: Reply moved to !249.

From a compatibility PoV:

sounds like more C-like quote handling. Probably not very much of a compatibility problem (while using the quote inside quoted string before was possible, it was strange already).
disallows multi-word unquoted values. Will probably break things.
severe restrictions of what can appear in a key/section name. I know I used + and - in the past when I needed to encode floating point values into the key names.

Oh yeah, spaces in key/section names are no longer allowed. (Although, if they are to be allowed, multiple should probably be collapsed into single spaces anyway, unless quoted.)

@pipping Don't take my previous comments as an immediate request to implement that all that stuff, I'm just trying to figure out how big the breakage would be. I see no way around some breakage in any case.

@joe Thank you for your feedback. It's greatly appreciated!

Adding [+-*=] and so forth back to the the prefixes and keys is entirely straight-forward. I just hadn't expected anyone to want or even already be doing that. But yes, I like to use Lisp style names with hyphens, too, since they're more readable than camlCase or underscores, and the only reason you cannot in a language like C is the inherent parsing ambiguity which does not apply here.
Yes, disallowing multi-word unquoted values was intentional. They're just too much of a mess w.r.t. whitespace. key = a b was different from key = a b (two spaces). So it did care about whitespace to some extent but of course then it discarded anything that came before the a or after the b. I think that asking the user for key = 'a b' or key = "a b" instead is reasonable.
So previously one could write [my prefix]? But one could not write long key name = value, correct? Hm. This feels very similar to the simple strings then: surrounding whitespace would be ignored but whitespace between words would be preserved (and collapsed?)?

While I see the 2nd point as the key backward-incompatibility, there are probably many others that I'm just not aware of yet.

@pipping

= should not be used in key or section names. It was impossible to put in key names before. It maybe was possible to use in section names, but anyone who did that had it coming.
Multi-word unquoted values are nice -- and used -- for specifying vectors. E.g. grid.upper = 2 2 2 to specify the upper coordinate for a structured grid. The parser for vector values just uses whitespace to separate elements, so multiple whitespace characters are OK there. It would, of course work just fine if you put quotes around the values, but users have to modify their existing config files, and now their vectors of numbers suddenly look like strings.
You could write [my prefix], and it had the same meaning as [ my prefix ], but was different from [my prefix] (two spaces). You could also write long key name = value, which would be different from long key name = value (again, two spaces). I'm not quite sure whether anyone used whitespace in key/section names though. Again, anyone who used more than one space character in a row in names had it coming.

@joe

I agree. But then I'm also a bit surprised that you haven't been burned by the special role of the dot when you were encod[ing] floating point values into the key names ;)
I think this is the key point. Everything is a string in these ini files (with both parsers -- just because of the way parameter trees work). Strings can later be converted to something else (once you call get). But everything is read in as a string. So you could read from a variable in different ways -- maybe even try to interpret it as one type first and then try another type. Or just extend get and have your parametertree support things like length = 3km or length = '3 km' through a template specialisation for a boost::units type. The handling of unquoted strings by the current parser, meanwhile, tries to hide the fact that everything is a string (pretending that you can feed it vectors and it will understand) at the cost of making things a tad inconsistent and weird. I personally prefer consistency here and think the cost of extra quotes is worth it. But it would be easy to reproduce the behaviour of the old parser if one wanted.
This is freaky stuff. Apparently, you could use a=b as a prefix. Which would give you a fully specified key name like a=b.c for a key c that you could not use without the [a=b] syntax: a=b.c = ... would always be parsed as an assignment to a. You could also do weird things like a b = c = d which would be parsed as an assignment of c = d to the variable a b. I find all of that rather pathological.

I think I replaced . by _, but I could not do that with + and -, since then I wouldn't be able to distinguish them anymore.
Sure, everything is a string at some point. Still, noone proposes to drop unquoted values altogether and force people to write integer = '3' for consistency instead of integer = 3. It's OK that internally there are different strings that yield the same value when parsed as a particular type. For instance, 0, 00, +0, -0, 0 (space-zero-space) all should result in the same value when parsed as an integer (but not when parsed as a float, even though they still compare equal). The thing with consistency is: there can be multiple mutually incompatible ways for things to work, each with its own internal consistency. Then consistency can't help you in deciding which one is better.
= is wrong in key/section names (unless someone implements quoting for them), but in values it is totally OK, quoted or not, IMO. E.g. target.cxxflags = -std=c++20

@joe Regarding your 2nd point, I think we're talking about different things now. I didn't mean "everything is a string" as "1 is a string representation of the successor of zero." -- i.e., not in a philosophical way. I was referring to the fact that to get anything out of a parametertree (by first reading from a file/stream/argv), you go through two steps: (1) you extract a string from the source, (2) you interpret that string as an instance of a certain type.

The second step will often ignore whitespace. But that doesn't mean that the first one can be sloppy with it. You can always choose std::string as your type and then all whitespace should be preserved in the second step, and thus in the first.

@joe After a bit more time of contemplation, I'm now starting to get annoyed by my own inability to compromise. At the end of the day, there's no point in forbidding a syntax that people have been using, and want to continue to use, when I'm not forced to use it myself.

So, before I just mimick the behaviour of key = a b c of the old parser, one final proposal (I'd like to move forward here after all): If whitespace is not a concern in such strings, maybe it would make sense to normalise them. So that both key = a b c and key = a b c (four spaces) are parsed as the same string, namely a b c. I find it very unlikely that this would lead to breakage for anyone, but it would stress the point that in these assignments, whitespace is not what the user cares about: For that, you have single- and double-quoted strings.

@pipping Maybe we should cut down on the latency and use a phone. I could not reach you just now, will try again tomorrow.

OK, let me try to summarize:

ParameterTreeParser follows two conflicting paradigms with regards to quoting and comments. This leads to inconsistent behaviour.
- Paradigm 1: if a value starts and ends with the same quote character the actual value is the one without the quotes.
- Paradigm 2: everything on a line starting with the first # is ignored as a comment.
- Problem: key = 'quoted ' # value ' # comment': what is the value?
- Current status: value is quoted (with a space at the end), giving Paradigm 2 priority.
At some point multiline values were introduced on top of the above mess. Multiline values follow Paradigm 2 on the first line but Paradigm 1 on subsequent line.
Errors are ignored.
The parser allows weird names that conflict with the rest of the syntax, e.g. [ a=b ] as a section or c[d]e = f as a key.
There was a problem with multiple newlines inside multiline string not being preserved (which I did not understand yet)

We had a few suggestions how to fix the above issues:

Re 1. and 2.:
- (a) Disallow comments on the same line as quoted values. This is basically following Paradigm 1 with all the consequences.
- (b) Properly parse quoted strings, including embedded escaped quotes. This is basically replacing Paradigm 1.
Re 3.:
- (c) Report errors in parsing (This seems to be universally agreed now).
Re 4.:
- (d) Disallow just syntax characters ([]=#'") in names.
- (e) Allow only whitelisted characters in names (currently alphanumeric + ._)
Re 5.:
- (f) Handle newlines properly in multi-line values

We had suggestions for further improvements:

(g) Unquoted values must be single words (seems to be abandoned).
(h) Collapse multiple whitespaces into single whitespaces in multi-word unquoted values.
(i) Collapse multiple whitespaces into single whitespaces in names (implies allowing spaces in the first place, contrary to (e)).
(j) Disallowing multiple . in names (and probably, . at the start or end of a fully qualified key name).
(k) Allow a key assignment on the same line as a section head: [a] b = c

Patches:

!246 (merged) Makes [section] # comment legal, which would have been silently ignored before, but there are still other errors that are ignored and the other issues are not addressed.
!247 (closed) Same as above, but also implement (k).
!248 (closed) Implements (a) + (c) + (d) + (f), I believe.
!249 Implements (b) + (c) + (e) + (f) + (g).

Thank you! That's a marvellous summary. The 5th issue you mentioned was that

key = '

value1

value2'

would turn into value1\n\nvalue2 rather than \n\nvalue1\nvalue2. And then there's of course also the 6th issue, which this report was originally about, that a comment would not be allowed on a line where a prefix was set. Finally, I think (k) can also be considered abandoned, and a cross between (d) and (e) that makes sense to me might be worth including, where there a blacklist as well but whitespace characters are automatically on it, too.

@pipping So the thing I was unclear about: Do you want to create a new parser with a saner syntax, or do you want to try and fix the old one's syntax?

And I'm going to close !248 (closed). It implements (a), but (b) is probably preferable, as it allows

key = 'value' # comment

Would you mind adapting !249 such that

+ and - are legal in names
spaces are legal in names (with or without collapsing)
unquoted values can be multi-word (with or without whitespace collapsing)
unquoted values can contain anything but '"#\ (forbidding '"\ allows extending quoting later, shell-style)
there are no two parser versions (no #ifdef to select between old and new, that leads to dead code)

@joe I've just pushed to !249. I've already addressed the following points in your five-point list: the 1st, 3rd, and 4th. The 2nd will take me a bit to implement, just like the 3rd took a bit. But I'll get around to it tomorrow, I'm sure. Finally, for the 5th: I did not mean that part of the MR as something that was meant to stay. For now, it still serves the purpose that you can easily check if the old and new parser behave any differently on a given input by extending the parametertreetest.cc: the resulting test parametertreetest and parametertreetest-new give you the old and new answer, respectively. The actual testing of the new parser I'm doing in iniparsertest.cc.

As for your earlier question

Do you want to create a new parser with a saner syntax, or do you want to try and fix the old one's syntax?

My answer to that changed a bit over time: I expected less objection to dropping unquoted strings with whitespace and I did not think that adding another format like JSON would be an option. So with that, my impression is that the current direction takes us to a parser for a format that is so close to the original that it'll hardly be viewed as an improvement feature-wise; it should thus try to break relatively little and instead hope to score bonus points with error reporting and thus through the parser rather than the format. So the answer is I'm trying to fix the old one's syntax, I reckon.

@pipping Let me know when you have addressed the second point, then I'll take another look. The fifth point is only there as a TODO reminder before actually merging.

@joe I've pushed two more commits to !249 now that permit whitespace in keys and prefixes, thereby addressing your 2nd point. (update: I've addressed the 5th and last point by now, too, dropped the WIP marker and cleaned up the history of the branch a bit, but not too much).

It might be worth noting that this whitespace handling (in unquoted strings, keys, and prefixes) makes the parser quite a bit longer and harder to understand/maintain and that's not surprising because the language it accepts is no longer LL(1) or LL(k) for any k: The interpretation of an arbitrarily long chunk of whitespace only becomes clear once one knows if what comes next terminates the prefix/key/string; this is reflected by the introduction of variables names like potentialChunkEnd for which it is not immediately clear if they're actually the end of the chunk in question.

Update: I've now switched from an approach of getting the tokens exactly right to one where I greedily eat all the trailing whitespace and discard it if necessary. That makes the code dramatically simpler.

@carsten.graeser To follow up on a conversation that we had the other day: I think supporting an additional, hierarchical, standardised format is a great idea. And JSON may seem like a good contender for that. At the same time, JSON is not as standardised as one might think. The link I had in mind when I mentioned that was Parsing JSON is a Minefield. The table of parsing results pretty much says it all.

I Have the impression, that TOML is quite close to what we currently try to achieve... https://github.com/toml-lang/toml/

@christi If by that you mean "TOML would make a good candidate for an additional format one could support, just like/instead of JSON", I find it a bit too early to tell because as the spec says

Be warned, this spec is still changing a lot. Until it's marked as 1.0, you should assume that it is unstable and act accordingly.

The good news is that the authors decided to version the spec so that an issue like the underspecification of JSON could actually be fixed, should it be found to apply to TOML as well.

If you meant your comment in the sense of "TOML is very close to the ini format we currently have" I'm afraid I have to disagree. They're quite different, to the point that of the five assignments in the example file from the current parametertreetest, namely

x1 = 1 # comment
x2 = hallo
x3 = no
array = 1   2 3 4 5\t6 7 8

[Foo]
peng = ligapokal

only x1 = 1 would be valid TOML.

I still think that supporting additionally standardized format would be nice but essentially a different issue.

One thing that came up when I discussed with @pipping: Having (almost) escape-sequence free raw multi-line values would be very nice. This e.g. allows to pass small snippets of script code in a painless way. But I'd not object to a cleaned up syntax that does not support it now. This could also be added later with an additional syntax (@pipping proposed opening and closing triple-quotes).

OK, this seems to have turned from a "Patch the holes in the ParameterTreeParser" into a general "New ParameterTreeParser" discussion. Which I also discussion with @christi some time ago. I'll try to briefly summarize what we discussed:

The ParameterTreeParser currently supports two things: named values and named sections. The ParameterTree puts them into a hierarchical structure, where each section contains named values and named subsection, and the name of the section is considered a prefix to the names of its children. In principle the ParameterTreeParser and the data structure used by the ParameterTree allow for a section to have the same name as a value, but the ParameterTree artificially prohibits that (which may be important for moves to other config formats).

Most config formats out there support three things, inspired by the data types of typical scripting languages: scalars, dicts, and arrays. The scalars would correspond to the values in the 'named values', the dicts would correspond to the sections in the 'named sections', but the ParameterTree has nothing corresponding to arrays¹.

Adding support for arrays to the ParameterTree would imply that we can now have 'things' without names in that ParameterTree data structure. That suggests something like²

using Scalar = std::string;
using Array = std::vector<Data>;
using Dict = std::map<std::string, Data>;
using Data = std::variant<Scalar, Array, Dict>;

There are also other things that we may want to support, @christi mention "symlinks", and I think at least YAML supports sets. Symlinks should be easy enough to add to the data structure (though maybe not to a self-written parser), sets may be a bit more tricky, because you need to worry about uniqueness based on values, not names or indices, and there is not really a way to address parts of a subtree that is a set member. You can easily access members of an array that is part of a dict, one possible syntax would be auto value = config_cast<int>(params.get("array").get(23)). But how what could you use in place of get when params would be a set, not a dict?

Parsing a FieldVector or similar out of a named value is supported, currently, but that is not an array for the purpose of the config file format, just a particular type of scalar.
Except that you can't write it quite like that, of course, you need to have a forward declaration somewhere in there.

closed via merge request !246 (merged)

mentioned in commit 97229ab6

Prefix parameters don't mix well with comments

Designs

Child items ...

Activity