Issues with clustering a dataset

MalTom24 · July 10, 2024, 7:56pm

I am trying to create a function (theta) that will compute the distance (in radians) between a data point, and a random centroid. I am running into two main issues that I hope I can get some feedback on.

The first being when I add the line “df = preprocessing.normalize(ds, axis = 0)” I get an attribute error as shown below.

I have tried converting the features to a float, but that does not seem to solve the issue. When I take that line out and switch the ds variable to df, the code is able to run. However, not as I am trying to.

The second issue I wanted to seek guidance on was creating the distance function that would compute the angle between two vectors (first vector being a point from the data set the other being a centroid).

When I tested out the equation on the distance variable on two (1x3) vectors it gave me the correct answer, however, I am having trouble tying it into a function to run between the random centroids and the dataset. My goal is to find the smallest theta between the data points, and the three centroids.

If anyone is able to point me to some helpful resources or give me some tips, I would greatly appreciate it. If anything needs to be clarified on my end, please let me know.

import numpy as np
import pandas as pd

housing = pd.read_csv('realestate.csv')
features = ['distance_to_station','latitude','longitude']
housing = housing.dropna()
data = housing[features].copy()
ds = ((data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))) * 9 + 1
#DATA SCALED (DS)
df = preprocessing.normalize(ds, axis = 0)
#Normalized to the unit sphere
k = 3
def random_centroids(df,k):
    centroids = []
    for i in range(k):
        centroid = df.apply(lambda x:float(x.sample()))
        centroids.append(centroid)
    return pd.concat(centroids, axis=1)
centroids = random_centroids(df,k)
print(centroids)
def theta(df, centroids):
    distances =np.arccos((np.dot(df, centroids)) / (np.linalg.norm(df) * np.linalg.norm(centroids)))
    return distances.idxmin(axis=1)
distances =centroids.apply(lambda x:np.arccos(np.dot(df, x) / (np.linalg.norm(df) * np.linalg.norm(x))))
print(distances)

kknechtel · July 10, 2024, 10:40pm

There’s nothing in your code that shows a definition for preprocessing, so there’s no way we can diagnose the code of preprocessing.normalize.

But the error message is pretty clear: the result you got back was a NumPy array, so you can’t use it like a Pandas DataFrame. It doesn’t have an apply method.

onePythonUser · July 10, 2024, 11:35pm

Hi,

radians has to do with angle and not distance. So, by definition, 1 radian is approximately 57.3 degrees (exact is 180/pi).

Are you referring to arclength? The definition of arclength is:

First, consider the units of the variables when calculating the arclength:

`s` units -> meters (distance along the arc of the circle)
`θ` units -> angle  (sweep angle between two points)
`r` units -> meters (radius of the circle or implied circle by the sweep)

Formula for calculating the arclength depending on units used for θ.

If angle θ in radians:

`s` = θ × r 

If angle θ in degrees:

`s` = θ × (π/180) × r = degree x (1/degrees) x meters = meters

As you can see, the arclength is in meters and θ is in either degrees or radians, depending on the unit preferred for calculating the arclength.

MalTom24 · July 11, 2024, 12:06am

Hey Karl, thank you for the reply. I did forget to include the sklearn import preprocessing from my code, I will edit the post to include it.

Understood, after normalizing the data, I added the line below and now I don’t get an issue when running it. Thank you for your explanation!

df = pd.DataFrame(df)

MalTom24 · July 11, 2024, 12:09am

Yeah my terminology was off when typing this haha. The reason I meant distance was since I am trying to deal with a unit sphere, the radius will be one, so the arc length between the two points will be equal to the angle between them.

onePythonUser · July 11, 2024, 12:33am

Fair enough. Be sure to use the correct formula for your calculations. In another post, different problem, another user was using the wrong units (i.e, wrong mathematical relationship), and thus his results were not correlating with the expected numerical outcomes.