[parametertree] Separate parser and parametertree

mentioned in merge request !580 (closed)

I disagree with the notion that the interpretation of the value strings is part of the parametertree (or as you put in in #137, "that one particular kind of values is also an array" in the status quo). The type of a value is entirely up to the user of the parametertree. The parametertree provides a few convenience conversions, some of them being to an array or vector, but the user is free to roll his own conversion if the convenience conversions don't suit him.

That is important to understand to not break backward compatibility more than we intend. This applies in particular to "properly define the syntax for the different data types in the ini-format and the argv-format". A syntax would mean you now have to start your program for instance with --filename.template='"000000"' (i.e. extra quoting) in order to not have the template interpreted as a number, where you are unable to recover the number of digits. Or you now have to insert quotes around values in all your ini files that happen to contain strings resembling numbers, or have internal space, or happen to be surrounded by some kind of parenthesis.

It is perfectly possible to support both parsers with and without type information. Just remember the type information when it is available, and remember that you have an untyped string when type information is not available.

Also, it is perfectly possible to implement e.g. a json parser for the current data structure. It would have some limitation and would need to make some assumptions, of course, but it is nevertheless possible.

To be clear: I still welcome this proposal in principle, because it lifts the limitation on representable recursive data structures, and also because of reasons that @christi did not mention:

There can be an opaque data type (different from untyped values, and different from typed values). This is useful for passing e.g. numpy arrays from python to C++ as part of a larger parametertree without serializing a pointer to a string and parsing it back.
Type information, when available, can be used to detect errors.

Indeed you have to be very precise by what you mean by parsing. Regarding the data structure and its parsing helpers we have:

The parameter tree is a simple hierarchic key-value store where values are always of the type std::string.
There's no parsing of files in the data structure ParameterTree.
Parsing of input data (ini-like file, options, ...) is provided by ParameterTreeParser which fills the tree. It's purpose is to translate the raw input into a ParameterTree. Hence this does not know about numbers and array. For convenience this supports trimming, quoting (to avoid trimming), and comments.
The only way to access the values in a tree is tree[key].

Additionally there's some utility functionality (for confusion implemented in ParameterTree::Parser) that support the user to query the std::string value and interpret it as T in a single pass using tree.get<T>(key). Notice that you can interpret the same value in many ways. However, all of these are conversions that don't change the actual value.

If you ini-file contains a line

a = " 000 "
b = 000

the parser does the string processing to derive the two different string values ' 000 ' (escaping needed due to markdown) and '000' for a and b as you can observe using operator[]. Now you can start to interpret these strings as various types using get<T>(key). When interpreting these as numbers, you get trimming and 000 is condensed to a single value. Interpreting as std::strings keeps the 000 but also trims such that tree["a"]!=tree["b"] while tree["a"]!=tree.get<std::string>("a") and tree.get<std::string>("a")==tree.get<std::string>("b").

If you intend to change all of this it first means that we need to change the stored value type e.g. to some variant. It will be very hard to integrate these into operator[] and/or get without breaking compatibility. Also assigning values would easily lead to confusion. Suppose you implement some new method for setting values that converts integer type values to the internally supported integer type (if possible). Then we would get:

char c = 49;
tree.setInt("a",c);
tree.getInt("a"); // this should be 49
tree["b"] = c;
tree.get<int>("b"); // this is 1

In principle think that a key-value store supporting more types is a good idea. But trying to integrate this into the current ParameterTree such that it plays nicely with the current behaviour is IMHO close to impossible and we'd better use a separate class for this.

But before designing a new data structure, inventing a new parameter file syntax, and implementing a corresponding parser I'd strongly suggest to have a look for existing implementations. We already reinvented the wheel once by not using Boost::PropertyTree.

So I would modify the suggestions as follows:

Remove possibility to have subtrees of the same name as keys (this is currently enforced when creating a subtree, but not when creating a key)
Nodes in the tree should be able to hold different data types:
- what json supports:
  - objects (a.k.a. subtrees)
  - array (of nodes, a.k.a. arbitrarily typed members)
  - strings
  - numbers (json makes no distinction between floating-point and integers...)
  - booleans
  - a null value
- untyped strings (what we currently support)
- maybe opaque data, possibly dynamically preserving type information (along the lines of std::any)
Keep the value parser in the ParameterTree, but only use it for untyped strings. (File parsers with type information will have their own convention of value representation anyway)
Mess with the syntax of ini files as little as possible. That family of file formats has historically no support for typed values, don't surprise users by introducing them.
The argv parser on the other hand may benefit from supporting some form of inline json. Though only in addition to untyped arguments. It could also benefit from the ability to specify array elements in paths in addition of "object" elements.

@joe's suggestion tries to stay as clean as possible. But it feels like proving a new implementation next to the old one inside of a single class. This would lead to the ambiguities outlined above. Starting from scratch in a new class will reduce confusion significantly.

If we adopt the JSON types we're in a similar situation as before. Either we extend the list by 'integer' dropping JSON compatibility or we always store floating point numbers and provide helpers to interpret them as integer as we currently do with our string values.

I'm surprised to learn we trim in pt.get<std::string>()... And I'm the one that apparently implemented that. What was I thinking...

There is one reason why @christi wants to avoid multiple parameter tree implementations living side-by-side: The effort to make Dune objects such as solvers, preconditioners, etc. configurable by constructing them with a parameter tree of some sort. Supporting multiple implementations would be quite the effort...

Why do you have to support both versions? I suspect that the feature to interpret a single value as different types is rarely used on purpose. Hence a new implementation supporting separate types could very well replace the old one in the long run. And you new configuration interface can be based on the old one.

If this is really needed, one could implement a conversion that parses a ParameterTree into the new structure by interpreting the string values like a file parser would do.

@carsten.graeser is right that supporting the current operator[] interface would be a problem.

on const ParameterTree, you'd need to convert the value to a std::string, which is doable (at least for those value types we care about).
on mutable ParameterTree, you'd need to return a proxy, or a reference to the internal node object.
- assignment to that proxy or node object will create an untyped value.
- However, implicit conversion of that proxy or node object interacts badly with auto, see autoCopy()

@carsten.graeser The argv parser would become inconvenient to use (by the invoker of the program) if it is forced to supply type information, so

either the data structure has to support untyped data
or the programmer needs to supply the argv parser with some sort of schema (same for any other parser that has no type information).

A schema would be difficult to construct, as it can depend on a lot of things. Different solvers may differ in their configuration options. The user may have inactive configuration sections. When considering the internal references proposal, one subtree may be defined in a location not expected by the schema to be referenced by other sections.

A schema would also be needed to convert an untyped ParameterTree in it's current form into a fully typed data structure.

I'm sorry but introducing proxy types and whatever tricks to force the new implementation into the old one is really not the way to go. Instead of being guided by the old implementation and by what could be integrated with this, we should do a new clear design from scratch. If this can be integrated into the old implementation painlessly (which I doubt for the given reasons) that's fine. Otherwise let's stay clean and forget about the old implementation.

For the record: Boost::PropertyTree allows for user-provided value types. Combining this approach with variant and some helper functions may lead to a very clear, flexible, and extensible design. In the simplest case you could do:

// FancyValue is variant<std::string, int, ...> + syntactic candy
template<class Value=FancyValue>
class FancyTreetree;

FancyTree tree;

tree["a"] = std::string("0");

tree["a"].get<std::string>(); // OK
tree["a"].get<int>();         // exception 
tree["a"].interpret<int>();   // OK, convert to int

tree["a"] = int(0);

tree["a"].get<std::string>();       // exception
tree["a"].get<int>();               // OK
tree["a"].interpret<std::string>(); // OK, convert to string

argv has nothing to do with the data structure, but with utility functions that allow to parse it mapping to the data structure we decided on.

But if you decide on a data structure with only fully typed values, your utility functions need to have a schema to be able to convert the output of the argv parser into that data structure.

Again, the data structure should not know the parser. It's the other way around. For sure we need to define how the parser maps to the data structure. But that should not be part of the data structure itself.

I don't follow? Are we still discussing whether the data structure should support untyped values? I don't see how such support implies the data structure knows anything about the parser.

I did not discuss the question of untyped data at all. I just argued that the fact that a parser needs to be provided a schema to interpret raw data is unrelated to the data structure itself and that we should first define the latter.

Well, but the parser only needs to interpret raw (or untyped) data if the data structure cannot transport untyped data. And whether or not the data structure supports transporting untyped data is very much a matter of how we define it.

So by defining our data structure we make requirements on our parser. And those requirements can mean that certain parsers become so inconvenient to implement or to use that they become useless. I don't see how we could discuss those aspects of parsers and data structure in isolation.

Looking at Boost Property Tree:

Nodes have a value and a list of key/value children. Children are ordered and keys can appear multiple times.

The ptree convenience type supports string values only. To support typed values we could use basic_ptree<std::string, FancyValue> where we implement FancyValue.

There is a json parser, but that converts any scalar into a string. Arrays are represented by children with empty keys.

That's not far from our own implementation. Attached you can find a prototype FancyTree (no, this name is not intended to stay) that I quickly hacked based on ParameterTree: fancytree.hh It basically does:

Switch value type from std::string to a template parameter with some variant<...> as default.
Drop the get<...> conversion mechanism
Introduce a new member function tree.get<T>(key) for convenience which essentially forwards to the more idiomatic but more verbose Std::get<T>(tree[key]).
The current conversion mechanism could be kept separately if it's really needed.

This was straight forward and seems to work as is. I like that this is extensible and avoids home-grown type-erasure in favour of a variant based solution. @christi: Does this go in the direction you had in mind?

While doing this I fixed many rough edges of the current ParameterTree:

Avoid the extra vectors storing the keys. We can simply transform the map using transformedRange()
Checking for keys and the get/sub methods where implemented in a very inefficiently and confusing way.
The handling of concurrent value and subtree was strange. There was no rigorous protection from creating a value and a subtree with the same key. On the other hand, asking if a value with given key exists leads to an exception, if there's a subtree instead of simply returning false

I could port most of these fixes to ParameterTree, but I'd not like to mess up with the current WIP MR.

@carsten.graeser your FancyTree goes in a similar direction as my own experiments. There are a few differences...

I didn't introduce a new function name to interpret the data in a different way. Currently this is always part of the get method and that was the way I was following.
I'm quite convinced that a slightly different get semantic (like the interpret in FancyTree) would work well with existing code, but surely we can run into some corner cases. The thing is that we might provide this fallback for a transition period.
I don't get @joe's concern regarding the argv parser. If the tree hold properly typed data all value parsing/interpretation has to be done in the argv-parser and that parser then has to decide which strings to interpret as a float and which not. This is then part of the definition of our parser.

I'm currently not sure about the need for a separate interpret. As we can't be sure that the user writes his double value properly, w will always use interpret instead of get, which means we can just drop get and the interpret is the new get:

[solver1]
...
damping = 1

[solver2]
...
damping = 0.7

just as a simple example.

In my prototype interpret is not implemented. I'm not convinced that always interpreting the value is a good idea. In this case I hardly see a change to the current ParameterTree. If the retrieval method converts to the desired type anyway, then the stored type is just a detail and using std::string is perfectly fine.

I'd rather say that we only need access to the real value and should drop the conversion. If you want a double, then store a double. This also avoids some nasty problems with the interpretation mentioned earlier.

Regarding your example configuration: We'd surely need to define a syntax which allows to be precise. But interpreting 1 as integer type while 1.0 is interpreted as floating point should not surprise anyone using C++.

interpreting 1.0 as double and 1 as int is not surprising, but also C++ allows me to implicitly convert 1 to 1.0, which the get method in my prototype does and which is in some way currently done by quite a few people in the ini files I've seen.

That was the reason to include the conversion if and only if C++ supports an implicit conversion and which is close to current use cases.

Surely we can decide to drop that feature, but that is then yet an other discussion.

Regarding the "no benefit" arguement. The mapping from string to value is currently part of the ParameterTree which imho is not a good choice. The obvious example which fails are vectors. We write

foo = 1 2 3 4

and read this as

auto f = pt.template get<std::array<int,4>>("foo")

No this requires the value to be stored as "1 2 3 4". Most file formats will have some other representation, like

foo = 1, 2, 3, 4

or

foo = (1, 2, 3, 4)

This is the first reason why I consider including this interpretation as part of the parsing and thus want to move it to the parser.

OK, only doing implicit conversions is a different thing. A drawback to this is, that get then has to return a temporary instead of a reference. Hence you cannot modify (not reassign) values like, tree.get<int>("i")++. But maybe that's OK, because you can still use Std::get<T>(tree["i"]) for this.

Here's a modification of my prototype allowing for implicit conversions. You mainly have to guard you visitor using std::enable_if<std::is_convertible<...>> and provide a throwing fallback overload.

fancytree.hh

The implicit conversion will unfortunately not help a lot with arrays since you cannot implicitly convert std::vector to std::array, Dune::FieldVector, or whatever you want to fill. Maybe this needs another utility.

That is true. My interpretation there was that we can fill vectors (vectors are more at the abstraction level of what a tree has internally) and cast the value type of the vector, if it is supported.

Regarding modification...

The get method was no option for modification, as it internally did the conversion.
I think in most setups we didn't modify the tree anyway and
I was considering the operator[] as some internal interface to setup. But you are right. Some users might want to do so. In this case we introduce a new way to set the values. With the get we offer an interface which is more in the spirit of a getter-setter interface and we might introduce explicit set methods, using only the fundamental data-types used in the variant.

This surely still leave us with the transition problem.

Perhaps we can just have a quick poll at on the mailing-list how many people are actually using the operator[] to set values?!

For the record: I do.

I totally missed the snippet @christi mentioned in the MR. Here's the link to Christians prototype of variant-like values: https://gitlab.dune-project.org/snippets/42

Both approaches are very similar.

Here's a version that also allows to convert containers. You can e.g. convert a std::vector<double> to a Dune::FieldVector<double,dim>. This is purely concept based and no types are fixed. The convertVariant() helper and the concepts could also be factored out.

@carsten.graeser I couldn't find your version that works with containers. Did you forget to upload?

@christi I'm sorry. Here it is:fancytree.hh. If the stored value cannot be converted implicitly to the requested type T, then two types of container conversion are attempted. The first one (for dynamically sized containers) uses T t; std::copy(begin,end,std::back_inserter(t)); while the other one (for statically sized containers) uses T t; std::copy(begin,end,t.begin(), t.end());

I would like to bring this topic up again. As @carsten.graeser mentioned above it is probably a good idea to check for existing implementations. I recently stumbled across yaml-cpp that seem to be like something we would look for. Yaml is a broadly used, well specified format (yaml.org) and supports aliasing as discussed in #137. yaml-cpp also supports to define custom types by specializing conversion classes, so we can have something like

YAML::Node node = YAML::Load("{vec: [3.14159, 0, 1]}");
FieldVector<double, 3> = node["vec"].as<FieldVector<double, 3>();

I think it would also be possible to convert ParameterTrees into YAML::Nodes to have a smooth transition.

I would have some questions regarding yaml-cpp...

is it (with reasonable effort) possible to implement your own readers and feed the YAML representation?
does yaml-cpp support the yaml anchor/reference syntax?
would it be possible to store large opaque objects in the tree?
is necessary to support our old file format, the python integration (!1080). Refrerence would be helpful to easy the definition of solvers in a parameter tree and 3. is something we use in duneuro (with some additional helper code) and it would be nice to have all feature supported by a single dune-common implementation.

mentioned in merge request !249

changed milestone to %ParameterTree overhault

[parametertree] Separate parser and parametertree

Designs

Child items 0

Activity