Help a noob out

CSteph · July 8, 2023, 3:41pm

I’m new to Python and coding in general, but I tried my hand at writing a prediction algorithm for, drum roll,laugh, winning the lottery :)). Unfortunately ive found myself stuck and i cant figure out why, despite the fact that it doesn’t show any errors before running, it still doesn’t output any predictions.
Could someone take a look and explain in dumb, dumb terms what i’m doing wrong and if/how i can fix it?

from itertools import combinations
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, chi2
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import Sequence




class DataSequence(Sequence):
    def __init__(self, X, y, batch_size):
        self.X = X
        self.y = y
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(len(self.X) / self.batch_size)

    def __getitem__(self, idx):
        batch_X = self.X[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return np.array(batch_X), np.array(batch_y)



# Draws data
draws = [
    {'date': '06-07-2023', 'numbers': [2, 13, 43, 12, 42, 9]},
    {'date': '02-07-2023', 'numbers': [42, 1, 6, 34, 45, 17]},
    {'date': '29-06-2023', 'numbers': [39, 11, 14, 21, 42, 3]},
]

# Sort the draws based on the date in ascending order
draws.sort(key=lambda x: datetime.strptime(x['date'], '%d-%m-%Y'))

# Flatten the draws into a single list
all_numbers = [number for draw in draws for number in draw['numbers']]

# Count the occurrences of each number
number_counts = Counter(all_numbers)

# Find the most common numbers
most_common_numbers = number_counts.most_common()


# Function to generate features from draws
def generate_features(draws):
    features = []
    dates = []

    for draw in draws:
        # Convert date string to datetime object
        date = datetime.strptime(draw['date'], '%d-%m-%Y')

        # Feature 1: Days since the last draw
        if len(dates) > 0:
            days_since_last_draw = (date - dates[-1]).days
        else:
            days_since_last_draw = 0
        features.append(days_since_last_draw)

        # Feature 2: Sum of Numbers
        features.append(sum(draw['numbers']))

        # Feature 3: Odd/Even Numbers Ratio
        odd_count = len([num for num in draw['numbers'] if num % 2 == 1])
        even_count = len([num for num in draw['numbers'] if num % 2 == 0])
        features.append(odd_count / even_count if even_count > 0 else 1)

        # Feature 4: Consecutive Numbers Count
        consecutive_count = sum(
            1 for i in range(len(draw['numbers']) - 1) if draw['numbers'][i] + 1 == draw['numbers'][i + 1])
        features.append(consecutive_count)

        # Feature 5: Number Frequency
        for number in range(1, 46):
            features.append(number_counts[number])

        # Feature 6: Number Sums
        number_sums = [sum(combination) for combination in combinations(draw['numbers'], 2)]
        features.extend(number_sums)

        # Additional Features
        # Feature 7: Month of the draw
        features.append(date.month)

        # Feature 8: Day of the week (Monday=0 to Sunday=6)
        features.append(date.weekday())

        # Feature 9: Prime Numbers Count
        prime_count = len(
            [num for num in draw['numbers'] if num in [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43]])
        features.append(prime_count)

        # Feature 10: Fibonacci Numbers Count
        fibonacci_count = len([num for num in draw['numbers'] if num in [1, 2, 3, 5, 8, 13, 21, 34]])
        features.append(fibonacci_count)

        # Feature 11: Mean of Numbers
        features.append(np.mean(draw['numbers']))

        # Feature 12: Median of Numbers
        features.append(np.median(draw['numbers']))

        # Feature 13: Standard Deviation of Numbers
        features.append(np.std(draw['numbers']))

        dates.append(date)

    return features


# Generate features and target labels for each draw
X = []
y = []

for i in range(4, len(draws)):
    features = generate_features(draws[i - 4:i])
    X.append(features)
    y.append(draws[i]['numbers'][-1])  # Target label is the last number in the draw

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-Test Split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Function to generate LSTM features from draws
def generate_features_lstm(draws):
    lstm_features = []
    dates = []

    for draw in draws:
        # Convert date string to datetime object
        date = datetime.strptime(draw['date'], '%d-%m-%Y')

        # Feature 1: Days since the last draw
        if len(dates) > 0:
            days_since_last_draw = (date - dates[-1]).days
        else:
            days_since_last_draw = 0
        lstm_features.append(days_since_last_draw)

        # Feature 2: Sum of Numbers
        lstm_features.append(sum(draw['numbers']))

        # Feature 3: Odd/Even Numbers Ratio
        odd_count = len([num for num in draw['numbers'] if num % 2 == 1])
        even_count = len([num for num in draw['numbers'] if num % 2 == 0])
        ratio = odd_count / even_count if even_count > 0 else 1
        lstm_features.append(ratio)

        # Feature 4: Consecutive Numbers Count
        consecutive_count = sum(
            1 for i in range(len(draw['numbers']) - 1) if draw['numbers'][i] + 1 == draw['numbers'][i + 1])
        lstm_features.append(consecutive_count)

        # Feature 5: Number Frequency
        number_counts = Counter(draw['numbers'])
        for number in range(1, 46):
            lstm_features.append(number_counts[number])

        # Feature 6: Number Sums
        number_sums = [sum(combination) for combination in combinations(draw['numbers'], 2)]
        lstm_features.extend(number_sums)

        # Feature 7: Month of the draw
        month = date.month
        lstm_features.append(month)

        # Feature 8: Day of the week (Monday=0 to Sunday=6)
        day_of_week = date.weekday()
        lstm_features.append(day_of_week)

        # Feature 9: Prime Numbers Count
        prime_count = len(
            [num for num in draw['numbers'] if num in [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43]])
        lstm_features.append(prime_count)

        # Feature 10: Fibonacci Numbers Count
        fibonacci_count = len([num for num in draw['numbers'] if num in [1, 2, 3, 5, 8, 13, 21, 34]])
        lstm_features.append(fibonacci_count)

        # Feature 11: Mean of Numbers
        mean = np.mean(draw['numbers'])
        lstm_features.append(mean)

        # Feature 12: Median of Numbers
        median = np.median(draw['numbers'])
        lstm_features.append(median)

        # Feature 13: Standard Deviation of Numbers
        std_deviation = np.std(draw['numbers'])
        lstm_features.append(std_deviation)

        dates.append(date)

    return lstm_features


# Generate LSTM features and target labels for each draw
X_lstm = []
y_lstm = []
dates = []

for i in range(4, len(draws)):
    features_lstm = generate_features_lstm(draws[i - 4:i])
    X_lstm.append(features_lstm)
    y_lstm.append(draws[i]['numbers'][-1])  # Target label is the last number in the draw
    dates.append(datetime.strptime(draws[i]['date'], '%d-%m-%Y'))

# Convert LSTM features and target labels to numpy arrays
X_train_lstm = np.array(X_lstm)
y_train_lstm = np.array(y_lstm)

print("Shape of X_train_lstm:", X_train_lstm.shape)
print("Shape of y_train_lstm:", y_train_lstm.shape)

# Reshape input data for LSTM
X_train_lstm = np.reshape(X_train_lstm, (X_train_lstm.shape[0], X_train_lstm.shape[1], 1))

# Generate LSTM features and target labels for the test set
X_lstm_test = []
y_lstm_test = []

for i in range(len(draws) - 3, len(draws)):
    features_lstm = generate_features_lstm(draws[i - 3:i])
    X_lstm_test.append(features_lstm)
    y_lstm_test.append(draws[i]['numbers'][-1])  # Target label is the last number in the draw

# Convert LSTM features and target labels to numpy arrays
X_test_lstm = np.array(X_lstm_test)
y_test_lstm = np.array(y_lstm_test)

# Reshape input data for LSTM
X_test_lstm = np.reshape(X_test_lstm, (X_test_lstm.shape[0], X_test_lstm.shape[1], 1))

# Update the shapes
print("Shape of X_test_lstm:", X_test_lstm.shape)
print("Shape of y_test_lstm:", y_test_lstm.shape)
print("Shape of X_test_lstm:", X_test_lstm.shape)
print("Shape of y_test_lstm:", y_test_lstm.shape)
print("Shape of X_test_lstm:", X_test_lstm.shape)
print("Shape of y_test_lstm:", y_test_lstm.shape)

# Create an LSTM model
model_lstm = Sequential()
model_lstm.add(LSTM(units=64, input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))
model_lstm.add(Dense(units=1))
model_lstm.compile(optimizer=Adam(), loss='mse')

# Create an instance of DataSequence for training
train_sequence = DataSequence(X_train_lstm, y_train_lstm, batch_size=32)

# Train the LSTM model
model_lstm.fit(train_sequence, epochs=10)


# Predict using the LSTM model
y_pred_lstm = model_lstm.predict(X_test_lstm)
# Evaluate the LSTM model
accuracy_lstm = accuracy_score(y_test, np.round(y_pred_lstm))
print("LSTM Accuracy:", accuracy_lstm)

# Reshape LSTM predictions for compatibility with other models
y_pred_lstm = y_pred_lstm.flatten().tolist()

# Combine LSTM predictions with original features
X_train_combined = np.concatenate((X_train, np.array(y_pred_lstm[:-len(X_test)]).reshape(-1, 1)), axis=1)
X_test_combined = np.concatenate((X_test, np.array(y_pred_lstm[-len(X_test):]).reshape(-1, 1)), axis=1)

# Model Selection and Hyperparameter Tuning (including Random Forest and Gradient Boosting)
models = {
    'Random Forest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 5, 10]
        }
    },
    'Gradient Boosting': {
        'model': GradientBoostingClassifier(),
        'params': {
            'n_estimators': [100, 200, 300],
            'learning_rate': [0.1, 0.01, 0.001]
        }
    }
}

best_models = {}

for model_name, model_info in models.items():
    print("Performing Grid Search for", model_name)
    model = model_info['model']
    params = model_info['params']
    grid_search = GridSearchCV(model, params, cv=5)
    grid_search.fit(X_train_combined, y_train)
    best_model = grid_search.best_estimator_
    best_models[model_name] = best_model
    print("Best Parameters:", grid_search.best_params_)
    print("Best Score:", grid_search.best_score_)

# Evaluate Models on Test Set
for model_name, model in best_models.items():
    model.fit(X_train_combined, y_train)
    y_pred = model.predict(X_test_combined)
    accuracy = accuracy_score(y_test, y_pred)
    print(model_name, "Accuracy:", accuracy)

# Calculate the probability based on previous draws
total_draws = len(draws)
probability_previous = {number: count / total_draws for number, count in most_common_numbers}

# Calculate the probability based on models' predictions
probability_models = {}
for model_name, model in best_models.items():
    predicted_probabilities = model.predict_proba(X_test_combined)
    for i, draw in enumerate(X_test_combined):
        predicted_number = int(y_pred[i])
        num_classes = len(model.classes_)
        if predicted_number < num_classes:
            if predicted_number in probability_models:
                probability_models[predicted_number] += predicted_probabilities[i][predicted_number]
            else:
                probability_models[predicted_number] = predicted_probabilities[i][predicted_number]

# Combine the probabilities from previous draws and models
combined_probability = {number: probability_previous.get(number, 0) + probability_models.get(number, 0)
                        for number in range(1, 46)}

# Sort the combined probability dictionary by values in descending order
sorted_combined_probability = sorted(combined_probability.items(), key=lambda x: x[1], reverse=True)

# Print the most probable draws
print("Most Probable Draws:")
for draw, prob in sorted_combined_probability:
    print(draw, "Probability:", prob)

# Generate the bar plot
x_labels = [str(draw[0]) for draw in sorted_combined_probability]
y_values_previous = [probability_previous.get(draw[0], 0) for draw in sorted_combined_probability]
y_values_models = [probability_models.get(draw[0], 0) for draw in sorted_combined_probability]

fig, ax = plt.subplots()
ax.bar(x_labels, y_values_previous, label='Previous Draws', alpha=0.5)
ax.bar(x_labels, y_values_models, label='Models', alpha=0.5)
ax.set_xlabel('Number')
ax.set_ylabel('Probability')
ax.set_title('Probability Distribution')
ax.legend()
plt.xticks(rotation=90)
plt.show()

Here’s the code, the historical data extends far more than the few lines there.

barry-scott · July 8, 2023, 9:15pm

Lotteries are tested to be pure random there is no way to predict a lottery.
Outputting nothing is the correct answer

How do you run the code?
What exactly does it output?

CSteph · July 9, 2023, 7:56am

Thank you, Barry.

I understand that a lottery draw can’t be predicted in theory, but I’m under the assumption that over time patterns emerge, I was thinking more on the “narrowing it down” line.
At least from observing the results, sometimes the appearance of a group of number determines a couple of numbers in the next draw, maybe it’s just the mind looking for patterns that aren’t there. Still, I’d like to test this hunch against real future results. It’s good practice anyway.
I’d be so grateful I you could take a look, I don’t have much knowledge in the field and I’ve been trying to wrap my brain around the error for a couple of days now, and I is suspect, the problem might not be the code but the brain, even though the code is kind of bad also, the loss rates are immense so far.

I’m running the code in PyCharm, as for the output, so far, here it is :

Shape of X_train_lstm: (1814, 284)
Shape of y_train_lstm: (1814,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
2023-07-09 09:20:54.189948: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/2
227/227 [==============================] - 110s 362ms/step - loss: 135.2188
Epoch 2/2
227/227 [==============================] - 82s 362ms/step - loss: 98.0582
1/1 [==============================] - 5s 5s/step
Traceback (most recent call last):
  File "C:\Users\Administrator\Desktop\Try.py", line 2093, in <module>
    accuracy_lstm = accuracy_score(y_test, np.round(y_pred_lstm))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\PycharmProjects\pythonProject24\venv\Lib\site-packages\sklearn\utils\_param_validation.py", line 211, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\PycharmProjects\pythonProject24\venv\Lib\site-packages\sklearn\metrics\_classification.py", line 220, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\PycharmProjects\pythonProject24\venv\Lib\site-packages\sklearn\metrics\_classification.py", line 84, in _check_targets
    check_consistent_length(y_true, y_pred)
  File "C:\Users\Administrator\PycharmProjects\pythonProject24\venv\Lib\site-packages\sklearn\utils\validation.py", line 409, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [363, 4]

Managed to get boggled in it even further.

Shape of X_train_lstm: (1814, 284)
Shape of y_train_lstm: (1814,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
Shape of X_test_lstm: (4, 284, 1)
Shape of y_test_lstm: (4,)
2023-07-09 10:42:18.965508: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/4
29/29 [==============================] - 45s 719ms/step - loss: 662.1057
Epoch 2/4
29/29 [==============================] - 21s 731ms/step - loss: 658.7585
Epoch 3/4
29/29 [==============================] - 21s 739ms/step - loss: 655.7486
Epoch 4/4
29/29 [==============================] - 21s 719ms/step - loss: 653.2798
1/1 [==============================] - 5s 5s/step
1/1 [==============================] - 0s 196ms/step
Shape of y_pred_lstm: (4, 1)
Shape of X_test_lstm: (4, 284, 1)
Traceback (most recent call last):
  File "C:\Users\Administrator\Desktop\Try.py", line 2101, in <module>
    X_train_combined = np.concatenate((X_train_lstm, y_train_lstm.reshape(-1, 1)), axis=1)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)

Process finished with exit code 1

barry-scott · July 9, 2023, 8:11am

In a word no. The process is random.
The previous outcomes have no mechanism to change future outcomes.

When you flip a coin its a 50% chance of heads or tails.
The coin does not have any memory of the previous results.
Each toss of the coin is alway 50%.
It is valid to get 100 heads in a row with a fair coin.

As for the errors, I do not know the libraries you are using,
but the error messages look helpful to guide you to fixes.

Rosuav · July 9, 2023, 8:39am

If they do, the lottery operators are making a horrible mistake that will ultimately cost them a lot of money. This HAS been known to be the case with roulette wheels (they’re physical objects and subject to wear and damage), and when it’s discovered by a member of the public before the casino finds out about it, they can win big - it’s happened before. But if that’s the sort of imperfection you’re looking for, don’t bother with prediction algorithms, just collect raw data on the frequency of different numbers showing up.

Now, if the lottery were based on a low-grade PRNG, then you might be onto something; it’s definitely possible for sequences to occur in those, although with a good quality PRNG like the Mersenne Twister, you might be staring at lottery results for the next few millennia without gaining enough useful data to predict anything. But if, as will be the case with any major lottery, it’s based on something as random and unpredictable as the operators can manage, you’re wasting your time looking for sequences.

As an example, I’ll use the random number picker from this Viva La Dirt League video, or if you prefer, the similar device used for the Daily Planet sweepstakes in the third (?) Superman movie, since it’s simple and not going to get too deep into the weeds. The machine consists of a large number of numbered balls and a means of picking one of them at random. If the balls are all identical mass, diameter, and surface friction, and the machine is properly turned before a selection is made, they should all be equally likely to show up. However, if one ball is a bit lighter than the others, it’s more likely to ride up on top of the other balls, and will be less likely to be selected. But it’s just as much less likely on any pull of a number. You won’t learn anything from seeing that the device chose 57, then 28; the chances of the number 40 coming up are the same as they’d be if other numbers had come up instead.

Which is a very long-winded way of saying: Patterns MAY emerge, but if they do, they’ll be visible on a mundane bar graph

Rosuav · July 9, 2023, 8:42pm

Please don’t delete your posts, it just forces us to click on the pencil to go see what you wrote (and doesn’t ACTUALLY delete anything)

barry-scott · July 9, 2023, 9:15pm

Are you aware that lotteries do not use the set of equipment for each draw?
The uk lottery has multiple sets of balls and multiple ball picker machines.
The choice of equipment is also randomised.

CSteph · July 9, 2023, 10:42pm

I wasn’t, but the fact that the model nearly hit 3 matches on the first try only makes me believe even more than somethings rotten in the state of Denmark. Guess time will tell, maybe I’ll be sending a bottle of champagne your way pretty soon .

cameron · July 10, 2023, 12:55am

Maybe we shouldn’t be dunking on Corneliu’s hypothesis, just exaine the
code? Then the OP can run it against various things (PRNGs from various
libraries, historic data from various otteries) and evaluate them?

Just because we expect there to be no patterns from the serious
lotteries doesn’t mean that they shouldn’t be subject to scrutiny. After
all, do we not all run unit and/or regression tests?

Yes: we should point out why many system have no patterns (or should
have no patterns) to discern.

Yes: Corneliu’s supposition that there will be patterns emerging may be
false for good lottery type systems.

No: we shouldn’t obsess over that (beying mentioning it) to the point of
offering no help with the code!

Cheers,
Cameron Simpson cs@cskk.id.au