Let int accept scientific notation strings

Marco_Sulla · February 10, 2023, 8:07am

post deleted by author

vovavili · February 10, 2023, 12:20pm

Serhiy Storchaka:

I am -1. It will break a code like:
try:
    x = int(s)
except ValueError:
    x = float(s)
Note also that 1e23 != 10**23, so int('1e23') would not be equal to float('1e23').

Would you also be opposed to an optional argument?

bignum = int("1e23", scientific=True)

PythonCHB · February 10, 2023, 10:57pm

Yes, it would break that code – I’m pretty surprised anyone would do that, but what do I know?

as for:

so int('1e23') would not be equal to float('1e23')

In [13]: int(1e23)
Out[13]: 99999999999999991611392

That’s actually one of the points of this whole idea what if you want the correct integer value for 1e23? it’s why just doing int(float(a_string)) isn’t a good solution.

Anyway, it seems there’s no chance of making this change without a flag – but if any of the core devs think it’s worth pursuing with a flag, I can write the idea out more formally.

steven.daprano · February 11, 2023, 2:01am

storchaka:
It will break a code like:
try:
    x = int(s)
except ValueError:
    x = float(s)
Yes, it would break that code – I’m pretty surprised anyone would do that, but what do I know?

How else are people currently supposed to convert numeric strings into ints for preference and fall back on floats if the string cannot be interpreted as an int?

And yes, I’ve done that. It’s an easy way to accept string input and have “123” convert to an int and “123.0” or “123e0” convert to a float.

Since 1e23 is syntax for a Binary64 base 2 float, the correct value for it is 99999999999999991611392.0.

If you expect differently, then maybe you should be working with Decimal instead of float? Just a thought to chew on.

We should keep in mind the difference between low level and high level APIs. Low level APIs should be strict. High level APIs can afford to be a bit more flexible. int() is a low level API which is moderately strict in what it accepts:

it ignores leading and trailing whitespace;
and underscores as grouping characters;
it also accepts non-ASCII digits (but maybe it shouldn’t? too late to change it now!)
otherwise it requires exactly the same format as int literals.

I propose instead that people come up with their own high level parse_int() function that can be as clever as they like in what it accepts.

Want to accept a leading unicode MINUS SIGN? Sure, why not?
Want to enforce better/stricter handling of non-ASCII digits? Go for it.
Want to treat “123450000e-3” as the exact int 123450? Sure thing!
Want to support other versions of scientific notation? Knock yourself out.

I’ve seen versions of parse_int that accepts “O” as a misspelled digit 0, and “l” as 1. I’ve seen versions that truncate the string at the first non-digit instead of raising an error, or that accept a string suffix coding what base to use, e.g. “1022#3” for 35 in ternary.

The point is, there are many options here. Design your own high level parse_int function, and use it, find out what works and what doesn’t, prove that it is useful, and then maybe it can be added to the stdlib or even turned into a builtin. Don’t mess with the low level int() API, which is already a bit too clever for my liking:

>>> int("\u0669\u07C2\u09EB\u0B67\u0BED\u0CEA\u0ED9\u1812")
92517492

Marco_Sulla · February 11, 2023, 3:45am

post removed by the author

PythonCHB · February 11, 2023, 5:24pm

In Python, and many computer languages, arguably yes. But in user input, CSV files, arbitrary text files, even JSON, not so much.

And even in code as a literal, I doubt many programmers specifically want 99999999999999991611392.0. Rather they want the closest binary floating point number to 10^23, and fully understand that it may not be exactly that value.

It seems in this thread that there is some confusion between literals in code, and strings that come from arbitrary other sources.

When in a text form from other sources, whom/whatever is producing that text may or may not be thinking in terms of floating point vs integers, decimal floating point vs binary floating point. and probably does not know what non-integer form the values will ultimately be stored in – single precision, double precision, binary, decimal?

Even JSON, which is designed to be written and read by software, does not make the distinction between integers and fractional values, nor is it defined as binary floating point. In practice, that’s how it’s mostly used, but that’s not the spec.

And if you want to use parsing the string to determine type, how do you distinguish between float and Decimal - there is literally no way to presently write this:

try:
    # see if it's an integer
    value = int(input_string) 
except ValueError:
    # see if it's a binary float
    value = float(input_string) 
except ValueError:
    # it must be a decimal
    value = Decimal(input_string)
except ValueError:
    # invalid
    print(f"Input: {input_string} is not a valid number")

And Python is pretty good about mingling ints and floats – in most cases, if code interpreted a string such as “1e23” as an int, it would either get cast to a float later on in computation, or remain an int and give slightly more precise results (perhaps at the expense of computation speed)

And for, say, array.array or numpy arrays, where you can’t mingle types, the types need to be determined or specified some other way – what size integer? single or double precision float? who knows from reading the text string.

I have written a LOT of code that parsed text files – and I always determine the type by how it’s going to be used in the code, not by the form it is in in the text file. e.g. if it’s a wind speed, it’s a float, if it’s the number of measurements takes, it’s an int, whether or not the file has ‘12’ as a wind speed or ‘3.0’ as the number of measurements taken.

So this boils down to what is the point of integer string parsing?

Anyway, there is no need to come to any sort of consensus here – changing how int parses strings is backward–incompatible, and that’s a no-no, no matter what I personally think about the impact.

So the only option is a new flag or a separate function, as Guido suggested way back at the beginning of this thread.

Personally, it’s of less interest to me with a flag, so I have no reason to push for this, but if any core devs are interested, I’d be glad to help out.

Actually, this is a good point – I really should start using Decimal in my string parsing code – it’s rarely performance intensive. I may just put something like this in my personal toolbox:

In [52]: for s in input_strings:
    ...:     val = Decimal(s)
    ...:     if val%1 == 0:
    ...:         val = int(val)
    ...:         print(f"'{s}' is an integer: {val}")
    ...:     else:
    ...:         print(f"'{s}' is not an integer")
    ...: 
'123' is an integer: 123
'1.23' is not an integer
'0.123' is not an integer
'1.23e10' is an integer: 12300000000
'1.2345e3' is not an integer

Rosuav · February 11, 2023, 5:33pm

That’s because JSON deliberately does not define any semantics, only syntax. Which means that parsing the same JSON file with different libraries can result in different behaviour, such as whether ints and floats are even different.

EpicWink · February 11, 2023, 9:07pm

I think it’s more important to match expectations than to keep backwards-compatibility. The question of this thread is what those expectations are.

I would say that experienced programmers are saying that e-notation only represents floats, but scientists (and other users) can expect lossless representation of ints when provided.

It also seems some programmers expect int to have a narrow scope (which I agree with), but don’t forget it also uses the __int__ and __index__ protocols for type conversion.

I think relying on the backwards-compatibility argument is toxic and should be supported with more substantial reasoning (eg the cost doesn’t outweigh the benefit, as programs rely on this inaccuracy).

Marco_Sulla · February 11, 2023, 9:13pm

But this work:

>>> eval("1e23".replace("e", "0 ** "))
100000000000000000000000

Marco_Sulla · February 12, 2023, 2:51pm

Other users can use Java. In Java 17, 1e23 == Math.pow(10, 23) is true. I don’t think many scientists use Java, but who knows?

(PS: not really suggesting to move to Java. Really wanting to improve Python)

eryksun · February 12, 2023, 4:35pm

I haven’t used Java in years, but doesn’t Math.pow() coerce its arguments to primitive double and return a primitive double? That’s equivalent to Python’s math.pow() function, which returns a float, which is not to be confused with builtin pow().

Marco_Sulla · February 12, 2023, 6:39pm

Yes, I was wrong. Indeed:

  BigDecimal a = new BigDecimal(1e23);
  System.out.println(a.toBigInteger());

returns 99999999999999991611392 anyway. You have to pass the string "1e23" to have an exact BigInteger, that’s the equivalent of using decimal in Python:

>>> int(decimal.Decimal('1e23'))
100000000000000000000000

PythonCHB · February 12, 2023, 7:23pm

well, yes and no – I said “it’s a no-no” primarily because in my experience, and in this thread, I think the core devs wouldn’t want to break backwards compatibility in this case.

As for the importance of matching expectations – I think “true division” is a great example of that – but that change wan’t made until py3 without a __future__ import – and this is not worth a __future__ import.

So I think changing the behavior of `int(a_string)" is off the table without a flag, unless a core dev is going to advocate otherwise.

user0 · February 23, 2023, 3:30am

I’m not sure I understand your code example. Could you explain what its use/purpose would be, realistically, in the context of a program?

PythonCHB · February 23, 2023, 6:48am

Do you mean this one?

try:
    x = int(s)
except ValueError:
    x = float(s)

That is a way to parse a string to decide if it is an integer or a float (or invalid).

First you try to make in integer, if that fails, then assume it’s a float. If that fails, it’s not a valid numeric string

In [5]: def what_is_it(s):
   ...:     try:
   ...:         x = int(s)
   ...:         print("It's an integer")
   ...:     except ValueError:
   ...:         try:
   ...:             x = float(s)
   ...:             print("It's a float")
   ...:         except ValueError:
   ...:             print("it's invalid")
   ...: 

In [6]: what_is_it("345")
It's an integer

In [7]: what_is_it("3.45")
It's a float

In [8]: what_is_it("3.45.45")
it's invalid

In [9]: what_is_it("1e5")
It's a float

The proposed change would return “It’s an integer” for that last example.

However, while it would change the results of that code, would that lead to an error? LIkely not – in real code, this is usually not just to know what’s the string – but to actually create a numeric type from the string – and python is dynamic enough that using an int for an integer value in place of a float will work almost exactly the same.

But you can bet that someone, somewhere, is counting on that behavior.

user0 · February 23, 2023, 12:56pm

I understand what the 4-line snippet does, mechanically speaking. What I’m confused about is @storchaka’s claim that this proposal would “break code”. Their snippet must be a fragment of a larger program with some purpose. What program is that? In what situation would it be a problem to proceed with an int, instead of the corresponding float, when an int or float is expected? Could @storchaka please clarify?

steven.daprano · February 23, 2023, 2:00pm

It is not the cases where you expect either an int or a float that will break. It is the cases where you are expecting an int, but receive a float, or a float, but receive an int.

In this specific case, the problem is that if your numeric strings are supposed to be digits “12345”, then any non-digit will fail: “123g5” is an error, and correctly so. Right? But what if the g is an e instead? Instead of correctly failing it will return an unexpected value 12300000.

You might answer that such an example is unlikely to happen by accident, and besides that’s no worse than a data corruption error where the digit 4 is turned into a digit 9. True.

But there are other pieces of code that will break: data validation code, for starters. Code that checks whether a string in numeric using (for example) string.isdigit() will no longer correctly identify integer strings. Neither will regex tests.

>>> "123e5".isdigit()
False

This case is different from the case (say) int("0x2EF3", 0) because in this case, in order for the “x” to be accepted as part of the numeric string, you have to give an explicit additional argument to switch to that special behaviour.

It doesn’t just automatically happen. You have to opt-in to it, in which case presumably you know what you are doing and have taken steps to make it work.

boxed · February 23, 2023, 2:23pm

I was on a product that got bitten in production by the change that allowed underscores in int() parsing. It had some pretty big consequences where invalid data (123_2_3) now suddenly silently was just parsed as valid! This was in fintech, so you can imagine our surprise and unhappiness.

I am very strongly -1 on this proposal for this reason.

user0 · February 23, 2023, 2:45pm

It is not the cases where you expect either an int or a float that will break.

My question was about storchaka’s example.

But thanks for your comment.

Code that checks whether a string in numeric using (for example) string.isdigit() will no longer correctly identify integer strings. Neither will regex tests.

>>> "123e5".isdigit()
False

It already doesn’t. Per the docs:

Optionally, the string can be preceded by + or - (with no space in between), have leading zeros, be surrounded by whitespace, and have single underscores interspersed between digits.

Example: int(' -1_2 ')

user0 · February 23, 2023, 2:48pm

It sounds like you’d want to use something stricter than the int function in that case, anyway.