post deleted by author
Would you also be opposed to an optional argument?
bignum = int("1e23", scientific=True)
Yes, it would break that code â Iâm pretty surprised anyone would do that, but what do I know?
as for:
so int('1e23')
would not be equal to float('1e23')
In [13]: int(1e23)
Out[13]: 99999999999999991611392
Thatâs actually one of the points of this whole idea what if you want the correct integer value for 1e23? itâs why just doing int(float(a_string))
isnât a good solution.
Anyway, it seems thereâs no chance of making this change without a flag â but if any of the core devs think itâs worth pursuing with a flag, I can write the idea out more formally.
How else are people currently supposed to convert numeric strings into ints for preference and fall back on floats if the string cannot be interpreted as an int?
And yes, Iâve done that. Itâs an easy way to accept string input and have â123â convert to an int and â123.0â or â123e0â convert to a float.
if you want the correct integer value for 1e23
Since 1e23 is syntax for a Binary64 base 2 float, the correct value for it is 99999999999999991611392.0.
If you expect differently, then maybe you should be working with Decimal instead of float? Just a thought to chew on.
We should keep in mind the difference between low level and high level APIs. Low level APIs should be strict. High level APIs can afford to be a bit more flexible. int()
is a low level API which is moderately strict in what it accepts:
- it ignores leading and trailing whitespace;
- and underscores as grouping characters;
- it also accepts non-ASCII digits (but maybe it shouldnât? too late to change it now!)
- otherwise it requires exactly the same format as int literals.
I propose instead that people come up with their own high level parse_int()
function that can be as clever as they like in what it accepts.
- Want to accept a leading unicode MINUS SIGN? Sure, why not?
- Want to enforce better/stricter handling of non-ASCII digits? Go for it.
- Want to treat â123450000e-3â as the exact int 123450? Sure thing!
- Want to support other versions of scientific notation? Knock yourself out.
Iâve seen versions of parse_int
that accepts âOâ as a misspelled digit 0, and âlâ as 1. Iâve seen versions that truncate the string at the first non-digit instead of raising an error, or that accept a string suffix coding what base to use, e.g. â1022#3â for 35 in ternary.
The point is, there are many options here. Design your own high level parse_int function, and use it, find out what works and what doesnât, prove that it is useful, and then maybe it can be added to the stdlib or even turned into a builtin. Donât mess with the low level int()
API, which is already a bit too clever for my liking:
>>> int("\u0669\u07C2\u09EB\u0B67\u0BED\u0CEA\u0ED9\u1812")
92517492
post removed by the author
Since 1e23 is syntax for a Binary64 base 2 float, the correct value for it is 99999999999999991611392.0.
In Python, and many computer languages, arguably yes. But in user input, CSV files, arbitrary text files, even JSON, not so much.
And even in code as a literal, I doubt many programmers specifically want 99999999999999991611392.0
. Rather they want the closest binary floating point number to 10^23, and fully understand that it may not be exactly that value.
It seems in this thread that there is some confusion between literals in code, and strings that come from arbitrary other sources.
When in a text form from other sources, whom/whatever is producing that text may or may not be thinking in terms of floating point vs integers, decimal floating point vs binary floating point. and probably does not know what non-integer form the values will ultimately be stored in â single precision, double precision, binary, decimal?
Even JSON, which is designed to be written and read by software, does not make the distinction between integers and fractional values, nor is it defined as binary floating point. In practice, thatâs how itâs mostly used, but thatâs not the spec.
And if you want to use parsing the string to determine type, how do you distinguish between float
and Decimal
- there is literally no way to presently write this:
try:
# see if it's an integer
value = int(input_string)
except ValueError:
# see if it's a binary float
value = float(input_string)
except ValueError:
# it must be a decimal
value = Decimal(input_string)
except ValueError:
# invalid
print(f"Input: {input_string} is not a valid number")
And Python is pretty good about mingling ints and floats â in most cases, if code interpreted a string such as â1e23â as an int, it would either get cast to a float later on in computation, or remain an int and give slightly more precise results (perhaps at the expense of computation speed)
And for, say, array.array
or numpy
arrays, where you canât mingle types, the types need to be determined or specified some other way â what size integer? single or double precision float? who knows from reading the text string.
I have written a LOT of code that parsed text files â and I always determine the type by how itâs going to be used in the code, not by the form it is in in the text file. e.g. if itâs a wind speed, itâs a float, if itâs the number of measurements takes, itâs an int, whether or not the file has â12â as a wind speed or â3.0â as the number of measurements taken.
So this boils down to what is the point of integer string parsing?
Anyway, there is no need to come to any sort of consensus here â changing how int
parses strings is backwardâincompatible, and thatâs a no-no, no matter what I personally think about the impact.
So the only option is a new flag or a separate function, as Guido suggested way back at the beginning of this thread.
Personally, itâs of less interest to me with a flag, so I have no reason to push for this, but if any core devs are interested, Iâd be glad to help out.
If you expect differently, then maybe you should be working with Decimal instead of float? Just a thought to chew on.
Actually, this is a good point â I really should start using Decimal
in my string parsing code â itâs rarely performance intensive. I may just put something like this in my personal toolbox:
In [52]: for s in input_strings:
...: val = Decimal(s)
...: if val%1 == 0:
...: val = int(val)
...: print(f"'{s}' is an integer: {val}")
...: else:
...: print(f"'{s}' is not an integer")
...:
'123' is an integer: 123
'1.23' is not an integer
'0.123' is not an integer
'1.23e10' is an integer: 12300000000
'1.2345e3' is not an integer
Even JSON, which is designed to be written and read by software, does not make the distinction between integers and fractional values, nor is it defined as binary floating point. In practice, thatâs how itâs mostly used, but thatâs not the spec.
Thatâs because JSON deliberately does not define any semantics, only syntax. Which means that parsing the same JSON file with different libraries can result in different behaviour, such as whether ints and floats are even different.
Anyway, there is no need to come to any sort of consensus here â changing how
int
parses strings is backwardâincompatible, and thatâs a no-no, no matter what I personally think about the impact.
I think itâs more important to match expectations than to keep backwards-compatibility. The question of this thread is what those expectations are.
I would say that experienced programmers are saying that e-notation only represents floats, but scientists (and other users) can expect lossless representation of ints when provided.
It also seems some programmers expect int
to have a narrow scope (which I agree with), but donât forget it also uses the __int__
and __index__
protocols for type conversion.
I think relying on the backwards-compatibility argument is toxic and should be supported with more substantial reasoning (eg the cost doesnât outweigh the benefit, as programs rely on this inaccuracy).
But this work:
>>> eval("1e23".replace("e", "0 ** "))
100000000000000000000000
scientists (and other users) can expect lossless representation of ints when provided
Other users can use Java. In Java 17, 1e23 == Math.pow(10, 23)
is true. I donât think many scientists use Java, but who knows?
(PS: not really suggesting to move to Java. Really wanting to improve Python)
Other users can use Java. In Java 17,
1e23 == Math.pow(10, 23)
is true. I donât think many scientists use Java, but who knows?
I havenât used Java in years, but doesnât Math.pow()
coerce its arguments to primitive double
and return a primitive double
? Thatâs equivalent to Pythonâs math.pow()
function, which returns a float
, which is not to be confused with builtin pow()
.
I havenât used Java in years, but doesnât
Math.pow()
coerce its arguments to primitivedouble
and return a primitivedouble
?
Yes, I was wrong. Indeed:
BigDecimal a = new BigDecimal(1e23);
System.out.println(a.toBigInteger());
returns 99999999999999991611392 anyway. You have to pass the string "1e23"
to have an exact BigInteger, thatâs the equivalent of using decimal
in Python:
>>> int(decimal.Decimal('1e23'))
100000000000000000000000
I think itâs more important to match expectations than to keep backwards-compatibility.
well, yes and no â I said âitâs a no-noâ primarily because in my experience, and in this thread, I think the core devs wouldnât want to break backwards compatibility in this case.
As for the importance of matching expectations â I think âtrue divisionâ is a great example of that â but that change wanât made until py3 without a __future__
import â and this is not worth a __future__
import.
So I think changing the behavior of `int(a_string)" is off the table without a flag, unless a core dev is going to advocate otherwise.
Iâm not sure I understand your code example. Could you explain what its use/purpose would be, realistically, in the context of a program?
Do you mean this one?
try:
x = int(s)
except ValueError:
x = float(s)
That is a way to parse a string to decide if it is an integer or a float (or invalid).
First you try to make in integer, if that fails, then assume itâs a float. If that fails, itâs not a valid numeric string
In [5]: def what_is_it(s):
...: try:
...: x = int(s)
...: print("It's an integer")
...: except ValueError:
...: try:
...: x = float(s)
...: print("It's a float")
...: except ValueError:
...: print("it's invalid")
...:
In [6]: what_is_it("345")
It's an integer
In [7]: what_is_it("3.45")
It's a float
In [8]: what_is_it("3.45.45")
it's invalid
In [9]: what_is_it("1e5")
It's a float
The proposed change would return âItâs an integerâ for that last example.
However, while it would change the results of that code, would that lead to an error? LIkely not â in real code, this is usually not just to know whatâs the string â but to actually create a numeric type from the string â and python is dynamic enough that using an int for an integer value in place of a float will work almost exactly the same.
But you can bet that someone, somewhere, is counting on that behavior.
I understand what the 4-line snippet does, mechanically speaking. What Iâm confused about is @storchakaâs claim that this proposal would âbreak codeâ. Their snippet must be a fragment of a larger program with some purpose. What program is that? In what situation would it be a problem to proceed with an int, instead of the corresponding float, when an int or float is expected? Could @storchaka please clarify?
It is not the cases where you expect either an int or a float that will break. It is the cases where you are expecting an int, but receive a float, or a float, but receive an int.
In this specific case, the problem is that if your numeric strings are supposed to be digits â12345â, then any non-digit will fail: â123g5â is an error, and correctly so. Right? But what if the g is an e instead? Instead of correctly failing it will return an unexpected value 12300000.
You might answer that such an example is unlikely to happen by accident, and besides thatâs no worse than a data corruption error where the digit 4 is turned into a digit 9. True.
But there are other pieces of code that will break: data validation code, for starters. Code that checks whether a string in numeric using (for example) string.isdigit()
will no longer correctly identify integer strings. Neither will regex tests.
>>> "123e5".isdigit()
False
This case is different from the case (say) int("0x2EF3", 0)
because in this case, in order for the âxâ to be accepted as part of the numeric string, you have to give an explicit additional argument to switch to that special behaviour.
It doesnât just automatically happen. You have to opt-in to it, in which case presumably you know what you are doing and have taken steps to make it work.
I was on a product that got bitten in production by the change that allowed underscores in int()
parsing. It had some pretty big consequences where invalid data (123_2_3
) now suddenly silently was just parsed as valid! This was in fintech, so you can imagine our surprise and unhappiness.
I am very strongly -1 on this proposal for this reason.
It is not the cases where you expect either an int or a float that will break.
My question was about storchakaâs example.
But thanks for your comment.
Code that checks whether a string in numeric using (for example)
string.isdigit()
will no longer correctly identify integer strings. Neither will regex tests.
>>> "123e5".isdigit()
False
It already doesnât. Per the docs:
Optionally, the string can be preceded by
+
or-
(with no space in between), have leading zeros, be surrounded by whitespace, and have single underscores interspersed between digits.
Example: int(' -1_2 ')
It sounds like youâd want to use something stricter than the int
function in that case, anyway.