Skip to content
Snippets Groups Projects

WIP: A generic string-to-number conversion function

Summary

Add facility to cast from character sequence to any number type, implemented in terms of std library functions like atoi() and strf() but parametrized with the concrete target type for ease of usage in generic code.

Motivation

There is not generic string-to-number conversion function in the std library, except for std::ostringstream. But, the latter is quite slow compared to type specific functions like strtof() (about a factor of 10 slower) and needs about 3 lines of code compared to a direct conversion function. In other libraries, like boost, generic functions for string-to-number conversions are added that are much faster that the intrinsics. Here a very simple approach is used to get a generic converter, but simply specializing a helper function on the concrete number type and calling the type-specific std function.

Application

I have in mind the conversion of quadrature points and weights in dune-geometry from a string representation of the number, to allow high-precision data to be used.

Discussion

Locale independent implementation

When using std library functions, like std::strtof these are locale dependent, meaning that e.g. the decimal dot may be different on different systems leading to different behavior of the code. Do we need locale independent conversion, e.g. for parsing ini files or for converting stored quadrature values? Or is the user responsible for setting the correct locale setting in its main program?

  • Locale independent conversion is currently only possible with std::stringstream, with c++17 we may have from_chars but there is no working implementation in any std library, currently.
  • Locale dependent implementation can be done with any strtof like functions. Additionaly, a check can be implemented that throws an error if the string can not be parsed correctly in the current locale setting.

Alternatives

The simplest alternative is std::ostringstream, but as explained above, it has the drawback of slow performance and not so easy to use. A second alternative is ParameterTree::Parser<T>::parse(...), but it is a private implementation details of the ParameterTree class and uses std::stringstream internally.

The usage of these std library functions has the drawback of little error safety, i.e. for atoi the default behavior is: If the converted value falls out of range of corresponding return type, the return value is undefined. If no conversion can be performed, ​0​ is returned. But on the other hand it gives high performance in conversion, what is intended here.

Naming

Alternative feature names could be considered:

  • lexicalCast (like boost::lexical_cast)
  • cast
  • toNumber (inversion of std::to_string)
  • toArithmetic
  • fromString
  • fromChars (similar to std::from_chars but with different signature)
  • parseString
  • parse (like ParameterTree::Parser<T>::parse)
  • strToNumber (like std::strtof)

TODO

  • Replace test that uses random number, with a deterministic test of corner cases
Edited by Christian Engwer

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • This has problems, some of which I explained here a minute ago, but here we go for completeness:

    • It is locale-dependent. This is a problem because external libraries may set the locale before your quadrature rules are initialized. (We actually ran into that problem in the wild under similar circumstances, which is the reason for parametertreelocaletest).
    • It does no error checking. Together with the locale dependency this is a recipe for disaster. You already explained the case where 0 is the error-result, but atof and friends will also happily ignore junk at the end. In the wrong locale (i.e. de), everything after a . will be silently ignored, which can lead to slightly-off-but-not-totally-wrong results.

    You do not need this functionality to solve the existing problems in the quadrature rules, I've outlined there three simple fixes that should solve them.

    I do not buy the performance argument for the purpose of the quadrature rules, because the quadrature rules are cached anyway, so the time needed for their initialization is set-up time.

    Despite that, I think something like this would be nice to have, and may actually solve not-yet-encountered problems with extended-precision types, such as types that do not have string constructors or broken (e.g. locale-dependent) string constructors, and which we cannot fix for whatever reason.

    Regarding the naming:

    • lexicalCast() implies locale-dependency, so should be avoided
    • fromString() also implies locale-dependency, due to its symmetry with to_string()
  • What about using setlocale before and after the numeric conversion function?, i.e.

    template<typename T>
    T lexicalCast(const char* str) 
    { 
      auto oldLocale = std::setlocale(LC_NUMERIC, nullptr);
      std::setlocale(LC_NUMERIC, "C");
      T x = Impl::LexicalCast<T>::eval(str); 
      std::setlocale(LC_NUMERIC, oldLocale);
    }

    I'm not sure how performance-expensive these changes of the locale setting are.

    A string constructor for, e.g., Float128, is also possible. But, maybe we do not have access to a type we want to use (i.e. we can not add such a constructor). Then, a non-intrusive utility would be helpful.

  • setlocale() changes global state. This makes it inherently non-threadsafe.

    One of the reason the parametertree uses stringstreams for that sort of thing is that we can imbue the C/POSIX/classic locale on just that stream object.

    The only other facility in C++14 I know of that does not require touching the global locale is num_get from <locale>, though already that gets you into virtual function call territory. Performance is probably still better than a stringstream, though. It should be able to replace atof and friends, though I never quite figured out how to use it, and how usable the error checking is.

  • OK, I never looked at the locale setting and stuff and do not have really experience with this. I just encountered, that the implementation of Float128 also uses a locale-dependent conversion (there is only this one available). And the implementation of the operator>> simply ignores this. Maybe there should be a warning at least, if the locale is not compatible to "C" for numbers.

  • True, that is a problem:

    #include <locale>
    #include <iostream>
    
    #include <quadmath.h>
    
    int main() {
      std::locale::global(std::locale("de_DE.UTF8"));
      std::cout << double(strtoflt128("2.3", NULL)) << std::endl;
    }

    Output:

    set -ex; g++ -lquadmath -o check test.cc; ./check
    + g++ -lquadmath -o check test.cc
    + ./check
    2
  • I guess for Float128 (or other extended-precision types that aren't used that often) we can mostly ignore the locale problem. Though we should still check that the conversion consumed all non-space characters from the input, and complain loudly if it did not. That way, if someone runs into the problem at least it won't fail silently.

    For the standard floating point types we should handle locales correctly, though

  • Just for completeness: GMPField expects the number to be in the correct format (corresponding to the current locale) otherwise throws an error, i.e.

    std::setlocale(LC_NUMERIC, "de_DE.UTF-8");
    GMPField<32> x1("0,12345"); // OK
    GMPField<32> x2("0.12345"); // Exception

    While the proposed mpfr backend for GMPField has the additional feature that the decimal point is either the one from the current locale or the period. Thus also the "C" format works for reading the values, i.e.

    std::setlocale(LC_NUMERIC, "de_DE.UTF-8");
    mpfr::mpreal x3("0.12345"); // OK
    mpfr::mpreal x4("0,12345"); // OK

    The only other locale-independent string to number conversion in c++17 is from_chars. But, we can not yet use it.

    Thus, we have two options:

    1. backporting from_chars from c++17 to c++14
    2. Using std::stringstream as a workaround
    3. A tricky workaround would be to replace . by the current locale decimal dot before calling strtod(). But, probably, one can find a locale where this approach crashes.

    The <locale> function std::num_get is used internally by stringstream for the parsing of the text.

  • And you cannot (automatically) use from_chars() for custom types such as GMPField. So I'd say make the conversion locale-independent on a best-effort basis:

    • If some locale-independend conversion method exists that we can use under the hood, we use that, otherwise
    • we use a locale-dependent conversion function, but make sure not the accept junk after any successfully parsed initial part.

    Three things need to come together for locale-dependent conversion to be a problem:

    • numeric types without locale-independent conversion
    • a locale with a numeric format that is incompatible with the standard format set in the environment of the running program
    • a library (such as pythons matplotlib with qt output) that automatically sets the locale from the environment.

    This does not happen often, and if it happens there is an easy workaround: start the program with LC_NUMERIC=C set in the environment. It's just important that the problem does not go unnoticed, so we need to ensure that the conversion complains in some way if the input string does not match the expected format.

    Of course, if the user or programmer really wants to use a problematic locale, there is nothing we can do. That is just unsupported.

    I do not consider alternative (really implementing the parsing ourselves down to the nitty details) an option.

  • Ok, I thought about two options:

    1. Use strtol and similar functions, perform a default conversion and check whether the whole string is parsed and no error flags are set. Not really locale-independent, but gives an error if values can not be interpreted correctly with current locale
    2. Use std::stringstream like in ParameterTree::Parser<T>::parse(...) with stream.imbue(std::locale::classic()) for all std::is_arithmetic types and and a fallback to a string constructor or user-defined overload otherwise. This is locale-independent, at least for standard types.

    The second option could also be wrapped into a backport of std::from_chars

  • strtol etc. should be reasonable as a backend for the built-in integer and floating point types.

    • pro: It accepts numbers in the C locale format regardless of currently set locale (I did not know that), and provides ways to check for trailing unparsed content and overflows.
    • It will silently skip initial whitespace (which is OK).
    • Trailing whitespace will end up as unparsed content. The frontend must either handle it explicitly, or must reject it at the cost of being inconsistent with initial whitespace handling.
    • con: it also accepts integers in the current locale in addition to the C locale, so can lead to string being accepted unexpectedly. This means someone can e.g. create configuration files in his native locale that work for him, but that will cause errors when distributed. I don't really consider this a no-no for parsing strings containing individual numbers (but it would be problematic for parsing strings containing multiple numbers, as , is a common separator and might be mistaken for part of a number).
    • con: needs individual implementations for each built-in type, or at least for signed, unsigned and floating. Also not a no-no.
    • pro: probably faster than going through the standard streams, since there's no need to chase vptr-pointers.
  • added 1 commit

    • 7818611e - lexical cast with additional test of errors during conversion

    Compare with previous version

  • Let me know when you consider your code ready for review.

  • added 1 commit

    • 0a16b9bc - allow trailing whitespaces in lexicalCast

    Compare with previous version

  • added 1 commit

    • 258bb39a - Cleanup of quadmath lexical-cast conversion

    Compare with previous version

  • I think, now a working version with strtoXX functions is implemented and can be reviewed.

    • Specialization for each standard arithmetic type
    • Make two tests: 1. all non-space characters are consumed during conversion and 2. non out-of-range error occurred
    • By default: fallback to string constructor
    • Specialization for Float128, FieldVector<T,1>, FieldMatrix<T,1,1>
    • Out-of-Range check by examining the errno (set by strtoXX functions) and by comparison with numeric_limits<T>::max/min for manual number conversion

    Example: (using locale C)

    lexicalCast<double>("  1.0"); // OK
    lexicalCast<double>("1.0  "); // OK
    lexicalCast<double>("1,0"); // ERROR
    lexicalCast<double>("1.0__"); // ERROR

    Performance: slightly faster than stream-based conversion (factor ~2-3)

    Edited by Simon Praetorius
  • added 1 commit

    • d8791278 - removed second template parameter in LexicalCastImpl in favour of auto

    Compare with previous version

    1. I have an issue with the name lexicalCast, as that implies locale-dependent format. Your implementation will always accept C/POSIX-locale formatted numbers, but it may or may not accept numbers in the current locale. One of your proposed names was strToNumber similar to strtof -- which is kind of appropriate since your implementation now closely mirrors the behaviour of strtof. However I'd modify the name to strTo, since the target type is now implied by a template parameter -- usage is then strTo<double>("0.1"), strTo<unsigned>("42") etc.
  • Jö Fahlke
  • added 1 commit

    • 16b668bd - renamed lexicalCast to strTo

    Compare with previous version

  • Jö Fahlke
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading