WIP: A generic string-to-number conversion function
Summary
Add facility to cast from character sequence to any number type, implemented in terms of std library functions
like atoi()
and strf()
but parametrized with the concrete target type for ease of usage
in generic code.
Motivation
There is not generic string-to-number conversion function in the std library, except for std::ostringstream
. But, the latter is quite slow compared to type specific functions like strtof()
(about a factor of 10 slower) and needs about 3 lines of code compared to a direct conversion function. In other libraries, like boost, generic functions for string-to-number conversions are added that are much faster that the intrinsics. Here a very simple approach is used to get a generic converter, but simply specializing a helper function on the concrete number type and calling the type-specific std function.
Application
I have in mind the conversion of quadrature points and weights in dune-geometry from a string representation of the number, to allow high-precision data to be used.
Discussion
Locale independent implementation
When using std library functions, like std::strtof
these are locale dependent, meaning that e.g. the decimal dot may be different on different systems leading to different behavior of the code. Do we need locale independent conversion, e.g. for parsing ini files or for converting stored quadrature values? Or is the user responsible for setting the correct locale setting in its main program?
- Locale independent conversion is currently only possible with
std::stringstream
, with c++17 we may havefrom_chars
but there is no working implementation in any std library, currently. - Locale dependent implementation can be done with any strtof like functions. Additionaly, a check can be implemented that throws an error if the string can not be parsed correctly in the current locale setting.
Alternatives
The simplest alternative is std::ostringstream
, but as explained above, it has the drawback of slow performance and not so easy to use. A second alternative is ParameterTree::Parser<T>::parse(...)
, but it is a private implementation details of the ParameterTree
class and uses std::stringstream
internally.
The usage of these std library functions has the drawback of little error safety, i.e. for atoi
the default behavior is: If the converted value falls out of range of corresponding return type,
the return value is undefined. If no conversion can be performed, 0 is returned. But on the other
hand it gives high performance in conversion, what is intended here.
Naming
Alternative feature names could be considered:
-
lexicalCast
(likeboost::lexical_cast
) cast
-
toNumber
(inversion ofstd::to_string
) toArithmetic
fromString
-
fromChars
(similar tostd::from_chars
but with different signature) parseString
-
parse
(likeParameterTree::Parser<T>::parse
) -
strToNumber
(likestd::strtof
)
TODO
-
Replace test that uses random number, with a deterministic test of corner cases
Merge request reports
Activity
This has problems, some of which I explained here a minute ago, but here we go for completeness:
- It is locale-dependent. This is a problem because external libraries may set the locale before your quadrature rules are initialized. (We actually ran into that problem in the wild under similar circumstances, which is the reason for parametertreelocaletest).
- It does no error checking. Together with the locale dependency this is a recipe for disaster. You already explained the case where
0
is the error-result, butatof
and friends will also happily ignore junk at the end. In the wrong locale (i.e.de
), everything after a.
will be silently ignored, which can lead to slightly-off-but-not-totally-wrong results.
You do not need this functionality to solve the existing problems in the quadrature rules, I've outlined there three simple fixes that should solve them.
I do not buy the performance argument for the purpose of the quadrature rules, because the quadrature rules are cached anyway, so the time needed for their initialization is set-up time.
Despite that, I think something like this would be nice to have, and may actually solve not-yet-encountered problems with extended-precision types, such as types that do not have string constructors or broken (e.g. locale-dependent) string constructors, and which we cannot fix for whatever reason.
Regarding the naming:
-
lexicalCast()
implies locale-dependency, so should be avoided -
fromString()
also implies locale-dependency, due to its symmetry withto_string()
What about using
setlocale
before and after the numeric conversion function?, i.e.template<typename T> T lexicalCast(const char* str) { auto oldLocale = std::setlocale(LC_NUMERIC, nullptr); std::setlocale(LC_NUMERIC, "C"); T x = Impl::LexicalCast<T>::eval(str); std::setlocale(LC_NUMERIC, oldLocale); }
I'm not sure how performance-expensive these changes of the locale setting are.
A string constructor for, e.g., Float128, is also possible. But, maybe we do not have access to a type we want to use (i.e. we can not add such a constructor). Then, a non-intrusive utility would be helpful.
setlocale()
changes global state. This makes it inherently non-threadsafe.One of the reason the parametertree uses stringstreams for that sort of thing is that we can imbue the
C
/POSIX
/classic locale on just that stream object.The only other facility in C++14 I know of that does not require touching the global locale is
num_get
from<locale>
, though already that gets you into virtual function call territory. Performance is probably still better than a stringstream, though. It should be able to replaceatof
and friends, though I never quite figured out how to use it, and how usable the error checking is.OK, I never looked at the locale setting and stuff and do not have really experience with this. I just encountered, that the implementation of
Float128
also uses a locale-dependent conversion (there is only this one available). And the implementation of theoperator>>
simply ignores this. Maybe there should be a warning at least, if the locale is not compatible to "C" for numbers.True, that is a problem:
#include <locale> #include <iostream> #include <quadmath.h> int main() { std::locale::global(std::locale("de_DE.UTF8")); std::cout << double(strtoflt128("2.3", NULL)) << std::endl; }
Output:
set -ex; g++ -lquadmath -o check test.cc; ./check + g++ -lquadmath -o check test.cc + ./check 2
I guess for Float128 (or other extended-precision types that aren't used that often) we can mostly ignore the locale problem. Though we should still check that the conversion consumed all non-space characters from the input, and complain loudly if it did not. That way, if someone runs into the problem at least it won't fail silently.
For the standard floating point types we should handle locales correctly, though
Just for completeness: GMPField expects the number to be in the correct format (corresponding to the current locale) otherwise throws an error, i.e.
std::setlocale(LC_NUMERIC, "de_DE.UTF-8"); GMPField<32> x1("0,12345"); // OK GMPField<32> x2("0.12345"); // Exception
While the proposed mpfr backend for GMPField has the additional feature that the decimal point is either the one from the current locale or the period. Thus also the "C" format works for reading the values, i.e.
std::setlocale(LC_NUMERIC, "de_DE.UTF-8"); mpfr::mpreal x3("0.12345"); // OK mpfr::mpreal x4("0,12345"); // OK
The only other locale-independent string to number conversion in c++17 is
from_chars
. But, we can not yet use it.Thus, we have two options:
- backporting
from_chars
from c++17 to c++14 - Using
std::stringstream
as a workaround - A tricky workaround would be to replace
.
by the current locale decimal dot before callingstrtod()
. But, probably, one can find a locale where this approach crashes.
The
<locale>
functionstd::num_get
is used internally bystringstream
for the parsing of the text.- backporting
And you cannot (automatically) use
from_chars()
for custom types such asGMPField
. So I'd say make the conversion locale-independent on a best-effort basis:- If some locale-independend conversion method exists that we can use under the hood, we use that, otherwise
- we use a locale-dependent conversion function, but make sure not the accept junk after any successfully parsed initial part.
Three things need to come together for locale-dependent conversion to be a problem:
- numeric types without locale-independent conversion
- a locale with a numeric format that is incompatible with the standard format set in the environment of the running program
- a library (such as pythons matplotlib with qt output) that automatically sets the locale from the environment.
This does not happen often, and if it happens there is an easy workaround: start the program with
LC_NUMERIC=C
set in the environment. It's just important that the problem does not go unnoticed, so we need to ensure that the conversion complains in some way if the input string does not match the expected format.Of course, if the user or programmer really wants to use a problematic locale, there is nothing we can do. That is just unsupported.
I do not consider alternative (really implementing the parsing ourselves down to the nitty details) an option.
Ok, I thought about two options:
- Use
strtol
and similar functions, perform a default conversion and check whether the whole string is parsed and no error flags are set. Not really locale-independent, but gives an error if values can not be interpreted correctly with current locale - Use
std::stringstream
like inParameterTree::Parser<T>::parse(...)
withstream.imbue(std::locale::classic())
for allstd::is_arithmetic
types and and a fallback to a string constructor or user-defined overload otherwise. This is locale-independent, at least for standard types.
The second option could also be wrapped into a backport of
std::from_chars
- Use
strtol
etc. should be reasonable as a backend for the built-in integer and floating point types.- pro: It accepts numbers in the C locale format regardless of currently set locale (I did not know that), and provides ways to check for trailing unparsed content and overflows.
- It will silently skip initial whitespace (which is OK).
- Trailing whitespace will end up as unparsed content. The frontend must either handle it explicitly, or must reject it at the cost of being inconsistent with initial whitespace handling.
- con: it also accepts integers in the current locale in addition to the C locale, so can lead to string being accepted unexpectedly. This means someone can e.g. create configuration files in his native locale that work for him, but that will cause errors when distributed. I don't really consider this a no-no for parsing strings containing individual numbers (but it would be problematic for parsing strings containing multiple numbers, as
,
is a common separator and might be mistaken for part of a number). - con: needs individual implementations for each built-in type, or at least for signed, unsigned and floating. Also not a no-no.
- pro: probably faster than going through the standard streams, since there's no need to chase vptr-pointers.
added 1 commit
- 7818611e - lexical cast with additional test of errors during conversion
I think, now a working version with
strtoXX
functions is implemented and can be reviewed.- Specialization for each standard arithmetic type
- Make two tests: 1. all non-space characters are consumed during conversion and 2. non out-of-range error occurred
- By default: fallback to string constructor
- Specialization for
Float128
,FieldVector<T,1>
,FieldMatrix<T,1,1>
- Out-of-Range check by examining the
errno
(set bystrtoXX
functions) and by comparison withnumeric_limits<T>::max/min
for manual number conversion
Example: (using locale C)
lexicalCast<double>(" 1.0"); // OK lexicalCast<double>("1.0 "); // OK lexicalCast<double>("1,0"); // ERROR lexicalCast<double>("1.0__"); // ERROR
Performance: slightly faster than stream-based conversion (factor ~2-3)
Edited by Simon Praetoriusadded 1 commit
- d8791278 - removed second template parameter in LexicalCastImpl in favour of auto
- I have an issue with the name
lexicalCast
, as that implies locale-dependent format. Your implementation will always accept C/POSIX-locale formatted numbers, but it may or may not accept numbers in the current locale. One of your proposed names wasstrToNumber
similar tostrtof
-- which is kind of appropriate since your implementation now closely mirrors the behaviour ofstrtof
. However I'd modify the name tostrTo
, since the target type is now implied by a template parameter -- usage is thenstrTo<double>("0.1")
,strTo<unsigned>("42")
etc.
- I have an issue with the name
- Resolved by Simon Praetorius
- Resolved by Simon Praetorius