I am trying to create a function (theta) that will compute the distance (in radians) between a data point, and a random centroid. I am running into two main issues that I hope I can get some feedback on.
The first being when I add the line “df = preprocessing.normalize(ds, axis = 0)” I get an attribute error as shown below.

I have tried converting the features to a float, but that does not seem to solve the issue. When I take that line out and switch the ds variable to df, the code is able to run. However, not as I am trying to.
The second issue I wanted to seek guidance on was creating the distance function that would compute the angle between two vectors (first vector being a point from the data set the other being a centroid).
When I tested out the equation on the distance variable on two (1x3) vectors it gave me the correct answer, however, I am having trouble tying it into a function to run between the random centroids and the dataset. My goal is to find the smallest theta between the data points, and the three centroids.
If anyone is able to point me to some helpful resources or give me some tips, I would greatly appreciate it. If anything needs to be clarified on my end, please let me know.
import numpy as np
import pandas as pd
housing = pd.read_csv('realestate.csv')
features = ['distance_to_station','latitude','longitude']
housing = housing.dropna()
data = housing[features].copy()
ds = ((data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))) * 9 + 1
#DATA SCALED (DS)
df = preprocessing.normalize(ds, axis = 0)
#Normalized to the unit sphere
k = 3
def random_centroids(df,k):
centroids = []
for i in range(k):
centroid = df.apply(lambda x:float(x.sample()))
centroids.append(centroid)
return pd.concat(centroids, axis=1)
centroids = random_centroids(df,k)
print(centroids)
def theta(df, centroids):
distances =np.arccos((np.dot(df, centroids)) / (np.linalg.norm(df) * np.linalg.norm(centroids)))
return distances.idxmin(axis=1)
distances =centroids.apply(lambda x:np.arccos(np.dot(df, x) / (np.linalg.norm(df) * np.linalg.norm(x))))
print(distances)