Need Advice / direction / help with "confidence scoring" or a "weighted scoring"

dmj120 · July 19, 2024, 10:40pm

I am new to Python, very new; and am slowly learning, especially now that Excel allows Python scripts… shouldn’t have stopped after 2 weeks. First, the right-side preview is pretty cool.

I am going to try to explain as best as I can, but since I’m unsure…hopefully it is understandable.

I am creating a scoring (of sorts) to get a “confidence level” (high, medium, low). These confidence levels will let me know how reliable expense totals are for another list (roughly 45k rows). The primary factors are dm lbr, dm parts, sm lbr, and sm parts. Plus, a few “modifiers” (or maybe variables), which depending if true or not needs to adjust the scoring.

Currently, I only have two results for the for primary factors, “>0” and “0” (ie has an expense or has $0 expense - this is due to or current data. Eventually, I will start looking at specific amounts to better identify our confidence, which will probably necessitate a couple more levels.

The below screenshot shows a truncated view of all the permutations. Columns E:H are the ‘modifiers’ I mentioned. So… first, cols E:F affect C:D - if sm lbr/parts are not requried or needed, the score should be slightly less if cols C and D are “0”, and and slightly if C:D are “>0” - but if E:F are “yes” then the score needs to be reduced by some factor because we know it’s needed, but we don’t have any data.

Columns G:H are the same but affect cols A and B.

From what I’ve read and heard, this is possible with Pandas, NumPy, and SciPy. If all this makes sense, please help.

I don’t know how/where to start, and any direction will help me piece this together -hopefully his weekend- so I can start pricing things based upon these confidence levels.

hansgeunsmeyer · July 19, 2024, 11:50pm

I’m trying to understand your question, but find it pretty difficult: (1) Can you explain what you mean by “how reliable the totals are”? Reliable in reference to what? Are the numbers based on some statistical sampling or does “reliable” mean something else here? (2) If possible, can you explain what the headers mean?
(3) It’s possible that none of the first two questions are relevant; I can imagine that you are merely looking to implement a kind of logical relation between the A:D columns and the E:H columns. Is that what you tried to suggest? If so, are you able to suggest a formula for that relation (treating “no” as 0 or as some small constant or as -1 and “yes” as 1 or as some larger constant)? Given a formula, it usually is straightforward to give a Pandas/Polars implementation.

dmj120 · July 20, 2024, 12:20am

Sorry it’s difficult to understand. I’ve tried several different approaches I think have confused myself.

I have a workbook of summary of expenditures across the company, which summarizes over 650k work orders down to about 45k model numbers. This workbook uses that data to price an item based upon expenses (labor time and parts) plus an applied margin.

Header names:

dm lbr = a tech’s time spent on repairs
dm parts = the amount of any parts used
sm lbr = time spent performing preventative maintenance (PM)
sm parts = any parts used during preventative maintenance (PM)
PM Req’d = does the model require a PM
PM parts needed = are PM parts needed to complete a PM
DM lbr between dvc hi/lo = does the average repair hours for that model fall within the min/max hours for the device — a device can have 100s of models
SM lbr between dvc hi/lo = does the average PM hours for that model fall within the min/max hours for the device — a device can have 100s of models

‘how reliable the totals are’ is looking at how reliable the total hours (repair/PM) and parts (repair/PM) are. Example: if a model should have a PM, and the total ‘sm lbr’ is 0, then that data point is unreliable - several reasons why it could, but at a macro level, if the model should have a PM we should have hours logged against it.

The end result of this will allow me to price-out an item based upon historical data, but only we I can reasonably say the data supports each pricing component (lbr/parts)

# my current excel formula used to calculate the score
IF(D6="unkn","0",
IF(AND(D6="no",E6="no"),SWITCH(AJ6,">0",4,-12)+SWITCH(AK6,">0",4,-4)+SWITCH(AL6,">0",4,4)+SWITCH(AM6,">0",4,4),
IF(AND(D6="yes",E6="yes"),SWITCH(AJ6,">0",4,-4)+SWITCH(AK6,">0",4,-4)+SWITCH(AL6,">0",4,0)+SWITCH(AM6,">0",2,0),
IF(AND(D6="yes",E6="no"),SWITCH(AJ6,">0",4,-4)+SWITCH(AK6,">0",4,-4)+SWITCH(AL6,">0",4,0)+SWITCH(AM6,">0",2,2),
IF(AND(D6="no",E6="yes"),SWITCH(AJ6,">0",4,-4)+SWITCH(AK6,">0",4,-4)+SWITCH(AL6,">0",4,4)+SWITCH(AM6,">0",2,2),"Need PM info"
)))))

# the excel formula that creates all the permutations (the excel screenshot in my first post
LET(A,B4:C5,B,ROWS(A),C,COLUMNS(A),D,B^C,E,UNIQUE(MAKEARRAY(D,C,LAMBDA(rw,cl,INDEX(IF(A="","",A),MOD(CEILING(rw/(D/(B^cl)),1)-1,B)+1,cl)))),FILTER(E,MMULT(--(E<>""),SEQUENCE(C,,,0))=C))

hansgeunsmeyer · July 20, 2024, 1:41am

Thanks, this clarifies a lot! I don’t have time right away to suggest how to do this - but will try to return to it tomorrow - unless someone else wants to step in earlier. It’s in principle pretty simple (assuming you can import pandas).

hansgeunsmeyer · July 20, 2024, 4:10pm

Ok. One more question. I see that your current Exel formula for the score does not refer to any columns in your example. Can I assume that D6 and E6 correspond to E and F (in the example dataframe), and that AJ6:AM6 correspond to A:D?

I didn’t see a response, so for now I will assume that the answer to my last question is “yes”. I trust that the code below still illustrates how to solve your problem if the answer is “no”.

I’ll also assume you have Python 3.11 or later and pandas 2.2.1 with the ability to read/write Exel files (on my Mac this required an extra python package install: openpyxl). My sample code can be used as stand-alone python code. I’ll try to keep the code as simple as possible. It’s definitely possible to optimize it and make it much more efficient, but this may not be worth the trouble for small files (< 1M rows). The current code may also make it easier for you to later modify it. I’ll be using simple letters ‘A’ to ‘H’ as column headers (you may need to replace those by ‘dm lbr’, ‘dm prts’ etc).

The Exel formula:

IF(E="unkn","0",
IF(AND(E="no",F="no"),SWITCH(A,">0",4,-12)+SWITCH(B,">0",4,-4)+SWITCH(C,">0",4,4)+SWITCH(D,">0",4,4),
IF(AND(E="yes",F="yes"),SWITCH(A,">0",4,-4)+SWITCH(B,">0",4,-4)+SWITCH(C,">0",4,0)+SWITCH(D,">0",2,0),
IF(AND(E="yes",F="no"),SWITCH(A,">0",4,-4)+SWITCH(B,">0",4,-4)+SWITCH(C,">0",4,0)+SWITCH(D,">0",2,2),
IF(AND(E="no",F="yes"),SWITCH(A,">0",4,-4)+SWITCH(B,">0",4,-4)+SWITCH(C,">0",4,4)+SWITCH(D,">0",2,2),"Need PM info"
)))))

can be directly translated into a function that operates on the rows of a pandas dataframe:

def confidence_score(row):
    match (row['E'], row['F']):
        case ("unkn", _):            
            score == 0
        case ("no", "no"):
            a = 4 if row['A'] == ">0" else -12
            b = 4 if row['B'] == ">0" else -4
            c = 4  # equivalent to:  c = 4 if row['C'] == ">0" else 4
            d = 4  # equivalent to: d = 4 if row['D'] == ">0" else 4 # d = 4
            score = a + b + c + d
        case ("yes", "yes"):
            a = 4 if row['A'] == ">0" else -4
            b = 4 if row['B'] == ">0" else -4
            c = 4 if row['C'] == ">0" else 0
            d = 2 if row['D'] == ">0" else 0
            score = a + b + c + d
        case ("yes", "no"):
            a = 4 if row['A'] == ">0" else -4
            b = 4 if row['B'] == ">0" else -4
            c = 4 if row['C'] == ">0" else 0
            d = 2 if row['D'] == ">0" else 2
            score = a + b + c + d
        case ("no", "yes"):
            a = 4 if row['A'] == ">0" else -4
            b = 4 if row['B'] == ">0" else -4
            c = 4  #  c = 4 if row['C'] == ">0" else 4
            d = 2  # d = 2 if row['D'] == ">0" else 2
            score = a + b + c + d
        case _:
            score = None  #  rather than "Need PM info"    
    return score

Note that I only replaced the default case (case _) output by None instead of a string. I did this in order to make sure that the datatype associated with the newly created column is a numerical datatype, rather than a generic ‘object’. From your question I can see you know Exel very well (definitely better than I do), so it should be straightforward to match up your formula with the python code. The statement x = y if condition else z is equivalent to the Exel SWITCH statement. The match ...: case ... block is a nicer way to represent messier if ... elif ... elif ... else ... blocks.

This function can be used in a script as follows:

import itertools as it
import pandas as pd

# this recreates your example dataframe in pandas
# the only difference is that I'm using simplified column header names
# 'itertools' is only used for creating the artifical sample data

columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
data = list(it.product(['>0', '0'], ['>0', '0'], 
                       ['>0', '0'], ['>0', '0'], 
                       ['no', 'yes'], ['no', 'yes'], 
                       ['no', 'yes'], ['no', 'yes']))

df = pd.DataFrame(data, columns=columns)

# to instead read an actual Exel file, you run: 
# df = pd.read_excel("data.xlsx")

# copy-paste the above `confidence_score` function definition here 

# create new column with the confidence score
df['I'] = df.apply(confidence_score, axis=1)

# print to screen
# print(df)

# save again as Exel file
df.to_excel("data.xlsx")

To make this easier to use for actual data, I suggest (1) make sure to explicitly only use one particular data type in each column; (2) rather than using strings “no” and “yes”, use either booleans or integers (0 and 1). In the code below I’ve made columns A:D all take integer values.
These modifications create clear and clean(er) code. They also make storage and processing a bit more efficient and make it easier to later modify the code. With those modifications, the complete script could be sth like this:

import itertools as it
import pandas as pd
columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
data = list(it.product([0, 1, 2], [0, 3, 7], [0, 1, 2], [0, 3, 7],
                       [True, False], [True, False], [True, False], [True, False]))
df = pd.DataFrame(data, columns=columns)

def confidence_score(row):
    match (row['E'], row['F']):
        case (False, False):
            a = 4 if row['A'] > 0 else -12
            b = 4 if row['B'] > 0 else -4
            c = 4  # c = 4 if row['C'] > 0 else 4
            d = 4  # d = 4 if row['D'] > 0 else 4
            score = a + b + c + d
        case (True, True):
            a = 4 if row['A'] > 0 else -4
            b = 4 if row['B'] > 0 else -4
            c = 4 if row['C'] > 0 else 0
            d = 2 if row['D'] > 0 else 0
            score = a + b + c + d
        case (True, False):
            a = 4 if row['A'] > 0 else -4
            b = 4 if row['B'] > 0 else -4
            c = 4 if row['C'] > 0 else 0
            d = 2 if row['D'] > 0 else 2
            score = a + b + c + d
        case (False, True):
            a = 4 if row['A'] > 0 else -4
            b = 4 if row['B'] > 0 else -4
            c = 4  # c = 4 if row['C'] > 0 else 4
            d = 2  # d = 2 if row['D'] > 0 else 2
            score = a + b + c + d
        case _:
            score = None
    return score

df['I'] = df.apply(confidence_score, axis=1)

(Note that I made yet one more modification: I subsumed the “unkn” case under the default case. So any “unkn” value in E will also lead to a None (null) value in the score. Not sure if that is wanted, it just seemed a little strange to me to give a confidence score of 0 if E==‘unkn’.)

dmj120 · July 21, 2024, 2:09am

@hansgeunsmeyer thank you for the code and explanations. I will play around with this and see what happens, lol.

I’m fairly proficient with excel, but now that I’m getting more into analytics, I really have to learn this.