WIP: A generic string-to-number conversion function

This has problems, some of which I explained here a minute ago, but here we go for completeness:

It is locale-dependent. This is a problem because external libraries may set the locale before your quadrature rules are initialized. (We actually ran into that problem in the wild under similar circumstances, which is the reason for parametertreelocaletest).
It does no error checking. Together with the locale dependency this is a recipe for disaster. You already explained the case where 0 is the error-result, but atof and friends will also happily ignore junk at the end. In the wrong locale (i.e. de), everything after a . will be silently ignored, which can lead to slightly-off-but-not-totally-wrong results.

You do not need this functionality to solve the existing problems in the quadrature rules, I've outlined there three simple fixes that should solve them.

I do not buy the performance argument for the purpose of the quadrature rules, because the quadrature rules are cached anyway, so the time needed for their initialization is set-up time.

Despite that, I think something like this would be nice to have, and may actually solve not-yet-encountered problems with extended-precision types, such as types that do not have string constructors or broken (e.g. locale-dependent) string constructors, and which we cannot fix for whatever reason.

Regarding the naming:

lexicalCast() implies locale-dependency, so should be avoided
fromString() also implies locale-dependency, due to its symmetry with to_string()

What about using setlocale before and after the numeric conversion function?, i.e.

template<typename T>
T lexicalCast(const char* str) 
{ 
  auto oldLocale = std::setlocale(LC_NUMERIC, nullptr);
  std::setlocale(LC_NUMERIC, "C");
  T x = Impl::LexicalCast<T>::eval(str); 
  std::setlocale(LC_NUMERIC, oldLocale);
}

I'm not sure how performance-expensive these changes of the locale setting are.

A string constructor for, e.g., Float128, is also possible. But, maybe we do not have access to a type we want to use (i.e. we can not add such a constructor). Then, a non-intrusive utility would be helpful.

setlocale() changes global state. This makes it inherently non-threadsafe.

One of the reason the parametertree uses stringstreams for that sort of thing is that we can imbue the C/POSIX/classic locale on just that stream object.

The only other facility in C++14 I know of that does not require touching the global locale is num_get from <locale>, though already that gets you into virtual function call territory. Performance is probably still better than a stringstream, though. It should be able to replace atof and friends, though I never quite figured out how to use it, and how usable the error checking is.

OK, I never looked at the locale setting and stuff and do not have really experience with this. I just encountered, that the implementation of Float128 also uses a locale-dependent conversion (there is only this one available). And the implementation of the operator>> simply ignores this. Maybe there should be a warning at least, if the locale is not compatible to "C" for numbers.

True, that is a problem:

#include <locale>
#include <iostream>

#include <quadmath.h>

int main() {
  std::locale::global(std::locale("de_DE.UTF8"));
  std::cout << double(strtoflt128("2.3", NULL)) << std::endl;
}

Output:

set -ex; g++ -lquadmath -o check test.cc; ./check
+ g++ -lquadmath -o check test.cc
+ ./check
2

I guess for Float128 (or other extended-precision types that aren't used that often) we can mostly ignore the locale problem. Though we should still check that the conversion consumed all non-space characters from the input, and complain loudly if it did not. That way, if someone runs into the problem at least it won't fail silently.

For the standard floating point types we should handle locales correctly, though

Just for completeness: GMPField expects the number to be in the correct format (corresponding to the current locale) otherwise throws an error, i.e.

std::setlocale(LC_NUMERIC, "de_DE.UTF-8");
GMPField<32> x1("0,12345"); // OK
GMPField<32> x2("0.12345"); // Exception

While the proposed mpfr backend for GMPField has the additional feature that the decimal point is either the one from the current locale or the period. Thus also the "C" format works for reading the values, i.e.

std::setlocale(LC_NUMERIC, "de_DE.UTF-8");
mpfr::mpreal x3("0.12345"); // OK
mpfr::mpreal x4("0,12345"); // OK

The only other locale-independent string to number conversion in c++17 is from_chars. But, we can not yet use it.

Thus, we have two options:

backporting from_chars from c++17 to c++14
Using std::stringstream as a workaround
A tricky workaround would be to replace . by the current locale decimal dot before calling strtod(). But, probably, one can find a locale where this approach crashes.

The <locale> function std::num_get is used internally by stringstream for the parsing of the text.

And you cannot (automatically) use from_chars() for custom types such as GMPField. So I'd say make the conversion locale-independent on a best-effort basis:

If some locale-independend conversion method exists that we can use under the hood, we use that, otherwise
we use a locale-dependent conversion function, but make sure not the accept junk after any successfully parsed initial part.

Three things need to come together for locale-dependent conversion to be a problem:

numeric types without locale-independent conversion
a locale with a numeric format that is incompatible with the standard format set in the environment of the running program
a library (such as pythons matplotlib with qt output) that automatically sets the locale from the environment.

This does not happen often, and if it happens there is an easy workaround: start the program with LC_NUMERIC=C set in the environment. It's just important that the problem does not go unnoticed, so we need to ensure that the conversion complains in some way if the input string does not match the expected format.

Of course, if the user or programmer really wants to use a problematic locale, there is nothing we can do. That is just unsupported.

I do not consider alternative (really implementing the parsing ourselves down to the nitty details) an option.

Ok, I thought about two options:

Use strtol and similar functions, perform a default conversion and check whether the whole string is parsed and no error flags are set. Not really locale-independent, but gives an error if values can not be interpreted correctly with current locale
Use std::stringstream like in ParameterTree::Parser<T>::parse(...) with stream.imbue(std::locale::classic()) for all std::is_arithmetic types and and a fallback to a string constructor or user-defined overload otherwise. This is locale-independent, at least for standard types.

The second option could also be wrapped into a backport of std::from_chars

strtol etc. should be reasonable as a backend for the built-in integer and floating point types.

pro: It accepts numbers in the C locale format regardless of currently set locale (I did not know that), and provides ways to check for trailing unparsed content and overflows.
It will silently skip initial whitespace (which is OK).
Trailing whitespace will end up as unparsed content. The frontend must either handle it explicitly, or must reject it at the cost of being inconsistent with initial whitespace handling.
con: it also accepts integers in the current locale in addition to the C locale, so can lead to string being accepted unexpectedly. This means someone can e.g. create configuration files in his native locale that work for him, but that will cause errors when distributed. I don't really consider this a no-no for parsing strings containing individual numbers (but it would be problematic for parsing strings containing multiple numbers, as , is a common separator and might be mistaken for part of a number).
con: needs individual implementations for each built-in type, or at least for signed, unsigned and floating. Also not a no-no.
pro: probably faster than going through the standard streams, since there's no need to chase vptr-pointers.

added 1 commit

7818611e - lexical cast with additional test of errors during conversion

Compare with previous version

Let me know when you consider your code ready for review.

added 1 commit

0a16b9bc - allow trailing whitespaces in lexicalCast

Compare with previous version

added 1 commit

258bb39a - Cleanup of quadmath lexical-cast conversion

Compare with previous version

I think, now a working version with strtoXX functions is implemented and can be reviewed.

Specialization for each standard arithmetic type
Make two tests: 1. all non-space characters are consumed during conversion and 2. non out-of-range error occurred
By default: fallback to string constructor
Specialization for Float128, FieldVector<T,1>, FieldMatrix<T,1,1>
Out-of-Range check by examining the errno (set by strtoXX functions) and by comparison with numeric_limits<T>::max/min for manual number conversion

Example: (using locale C)

lexicalCast<double>("  1.0"); // OK
lexicalCast<double>("1.0  "); // OK
lexicalCast<double>("1,0"); // ERROR
lexicalCast<double>("1.0__"); // ERROR

Performance: slightly faster than stream-based conversion (factor ~2-3)

added 1 commit

d8791278 - removed second template parameter in LexicalCastImpl in favour of auto

Compare with previous version

I have an issue with the name lexicalCast, as that implies locale-dependent format. Your implementation will always accept C/POSIX-locale formatted numbers, but it may or may not accept numbers in the current locale. One of your proposed names was strToNumber similar to strtof -- which is kind of appropriate since your implementation now closely mirrors the behaviour of strtof. However I'd modify the name to strTo, since the target type is now implied by a template parameter -- usage is then strTo<double>("0.1"), strTo<unsigned>("42") etc.

added 1 commit

16b668bd - renamed lexicalCast to strTo

Compare with previous version

WIP: A generic string-to-number conversion function

Summary

Motivation

Application

Discussion

Locale independent implementation

Alternatives

Naming

TODO

Activity

WIP: A generic string-to-number conversion function

Summary

Motivation

Application

Discussion

Locale independent implementation

Alternatives

Naming

TODO

Merge request reports

Activity