Help with error on Python Pipeline

Jacques · September 29, 2023, 12:32pm

I wrote the below code for a pipeline to process data in a dataframe. On execution I get this error:
ValueError: not enough values to unpack (expected 3, got 2)

I suspect that the error is caused by the FunctionTransformer, but I cannot figure out what the issue is. Can anyone help me find the error?
The code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

def funct(X):
  X.Age.replace({'<35':0,'<35':1},inplace=True)
  X.Accessibility.replace({'No':0,'Yes':1},inplace=True)
  X.MentalHealth.replace({'No':0,'Yes':1},inplace=True)
  X.MainBranch.replace({'NotDev':0,'Dev':1},inplace=True)
  X.YearsCode = np.sqrt(X.YearsCode)
  X.YearsCodePro = np.sqrt(X.YearsCodePro)
  X.PreviousSalary = np.sqrt(X.PreviousSalary)
  X.ComputerSkills = np.sqrt(X.ComputerSkills)
  X.Country = pd.util.hash_pandas_object(X.Country)
  X.HaveWorkedWith = pd.util.hash_pandas_object(X.HaveWorkedWith)

data = pd.read_csv('stackoverflow_full.csv')

data.info()

X = data.drop('Employed',axis=True)
y = data['Employed'].copy()

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

X_train, X_eval, y_train, y_eval = train_test_split(X_train,y_train,test_size=0.01,random_state=42)

#Columns to process
all_cols = list(data.columns)
all_cols.remove('Unnamed: 0')
drop_cols = ['Unnamed: 0']
onehot_cols = ['Gender']
ordinal_cols = ['EdLevel']
impute_cols = ['HaveWorkedWith']

#Instantiate Transformers to process columns
func_transformer = FunctionTransformer(func= funct)
onehot_transformer = Pipeline(steps=[('onehot encode',OneHotEncoder(handle_unknown='ignore'))],verbose=True)
ordinal_transformer = Pipeline(steps=[('ordinal encode',OrdinalEncoder())],verbose=True)
impute_transformer = Pipeline(steps=[('imputing',SimpleImputer(strategy='most frequent'))],verbose=True)
scaling_transformer = Pipeline(steps=[('scaling',StandardScaler())],verbose=True)

preprocessing = ColumnTransformer(transformers=[('drop cols','drop',drop_cols),('funcT',func_transformer),('onehot',onehot_transformer,onehot_cols),('ordinal',ordinal_transformer,ordinal_cols),('impute',impute_transformer,impute_cols),('scale',scaling_transformer,all_cols)],verbose=True)

model = Pipeline(steps=[('preprocessing',preprocessing),('clustering',KMeans(n_clusters=2))],verbose=True)

model.fit_transform(X_train,y_train)

hansgeunsmeyer · September 29, 2023, 12:44pm

Can you please post the complete stack trace? That can help to more quickly narrow down what/where the error is.

Jacques · September 29, 2023, 12:47pm

ValueError                                Traceback (most recent call last)
<ipython-input-45-20aa569525ea> in <cell line: 1>()
----> 1 model.fit_transform(X_train,y_train)

6 frames
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    435         """
    436         fit_params_steps = self._check_fit_params(**fit_params)
--> 437         Xt = self._fit(X, y, **fit_params_steps)
    438 
    439         last_step = self._final_estimator

/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    357                 cloned_transformer = clone(transformer)
    358             # Fit or load from cache the current transformer
--> 359             X, fitted_transformer = fit_transform_one_cached(
    360                 cloned_transformer,
    361                 X,

/usr/local/lib/python3.10/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    351 
    352     def __call__(self, *args, **kwargs):
--> 353         return self.func(*args, **kwargs)
    354 
    355     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    721         # set n_features_in_ attribute
    722         self._check_n_features(X, reset=True)
--> 723         self._validate_transformers()
    724         self._validate_column_callables(X)
    725         self._validate_remainder(X)

/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in _validate_transformers(self)
    396             return
    397 
--> 398         names, transformers, _ = zip(*self.transformers)
    399 
    400         # validate names

ValueError: not enough values to unpack (expected 3, got 2)

hansgeunsmeyer · September 29, 2023, 1:04pm

Thanks - I may have a look later - don’t have much time now.

Btw, totally unrelated to your problem, but this line is a bit “jarring”:

X = data.drop('Employed',axis=True)

axis should either be the integer 1 or the string “columns” here - makes the code nicer. In terms of styling, it’s also nicer to always put a space behind a comma - which really does make it easier to read. (Linters will also frown upon your styling.)

Jacques · September 29, 2023, 1:08pm

Hi Hans

That was a typo, and noting the suggestion about the space after the comma. Thanks for taking the time to help.

hansgeunsmeyer · September 29, 2023, 1:26pm

No, prob. I don’t have direct experience with using these pipelines (just toyed with them quite a while ago). But this should not be too hard to debug for yourself. The stack trace unfortunately doesn’t tell in which transform the error is located, so isn’t that much help. But to debug, have you tried to

run everything after first commenting out use of the funct transform - just to make sure the mechanics of the rest are ok
apply funct directly to the training data, to verify it works as expected
double-checked that a transforming function can act in place (?!) since your funct is not returning anything

Jacques · September 29, 2023, 3:22pm

Yes, tested all of the above and it works.
Because the trace doesn’t tell me where in the pipeline the failure is I’m lost.
Is there a different way to debug in Python? I find the trace mostly obscure and insufficient.

hansgeunsmeyer · September 29, 2023, 3:59pm

I have often felt the same frustration. Still the stack trace does give some vital, perhaps usable info.

One way to proceed is step through the code in a Python debugger and identify which object is causing the error.

$ python -m pdb --help

If you’re not familiar with that, it’s may have a bit of learning curve but is not too difficult to use… (has built-in help and help on each of the built-in debug commands).
There is a pretty decent tutorial at Python Debugging With Pdb – Real Python
(In the distant past I’ve also use a GUI for Python debugging on Windows, but I’m not familiar with any current ones, so cannot give any hints for that. VSCode might made debugging easier.)

Another way, consider line 398, and consider the error message. There is a ValueError, expecting 3 objects (398 expects 3), but only 2 objects are given. So, this suggests, something is wrong with the self.transformers. That list or container needs to contain sub-lists (or tuples etc) of exactly 3 items, but it only got 2 in at least one of the sub-lists.
Also, this is coming (apparently) from the ColumnTransformer. Ok, then, without knowing anything more about underlying code, consider the way you defined that…
First try out my earlier suggestion: Remove your custom transformer from the ColumnTransformer to verify that the rest is OK. (Something else might go wrong, but as long as it happens later, that’s fine.) If so, then consider if you passed that in correctly.
There are other, quick-and-dirty ways to debug, but I think the second one will let you proceed…
Also take another look at the docs for ColumnTransformer (notes about the transformers argument).

(Not knowing any of these APIs, I do wonder, can the custom function be a function defined on a whole dataframe?? Should it not be a function that is only defined on a scalar or series? The example in the FunctionTransformer doc also suggests this. The fit function of the class will work on (certain cols of) a dataframe, but that does not imply that the defining function should/can take a dataframe.)

Jacques · September 30, 2023, 12:41pm

Found the source of the error. I misunderstood the implementation of a Function Transformer. It wraps a function which gets applied to every value in a given dataframe colums. One passes the given columns to the transformer and it returns the new column values. In my implementation I assumed the underlying implementation provided unrestricted access to the entire dataframe - not true.
The second issue caused the error I saw. To execute the transformer in a pipeline, you pass it 3 parameters: 1. a descriptor of the step which can be used to identify the step. 2. The transformer to be executed at this step in the pipeline. 3. the columns of the dataframe to which the transformer must be applied.

I did not specify this third parameter for the Function Transformer and this caused the error. Once I corrected this I got a second error which indicated that the Function Transformer is broken.

So, my question is answered and now I need to relook my approach. But I guess that’s all in a day’s work.

I will certainly have a look at pdb, thank you for the tip.