Loops on Python

Hello,
I’m french, sorry for the mistakes.

I have a fairly substantial job to do on Python, but unfortunately I’m not in this field and I’m completely stuck.
I’ve created a database of 500 plant products. There’s a lot of information, but the most important thing is their protein and amino acid composition (in g/100g of plant product). I want to create a blend of plant-based products that is equal to or better than (but still as close as possible to) my reference, whey.
To do this, I want to use Python and loops, as in a Monte Carlo simulation. I want my program to go through all possible plant mixtures, with a step size of 10g and a maximum weight of 300g of the same product in the mixture. To give an example of what the loop should look like: 0g soy + 10g wheat, then 0g soy + 20g wheat, then 0g soy + 30g wheat, etc. up to the maximum weight, then 10g soy + 0g wheat, then 20g soy + 0g wheat, etc…

No matter how hard I look, I can’t get Python to understand this command. Here’s the beginning of the code (after importing the libraries and my data, of course) that I thought was right but isn’t :
‘’’

Maximum weight of plant product

p_max = 301 #grammes

Step

pas = 10 #grammes

Loops :

for poids in range(0, p_max, pas):
poids_g = poids / 100.0 # Convert weight to grams

# Calculate the amino acid and protein composition of the blend for the selected weight
melange = np.sum(veg_data * poids_g, axis=0)

if np.all(melange < ref_data):
        continue
if np.all(melange == ref_data):
        break
if np.all(melange > ref_data):
        break        

if melange is not None :
# Calculate the distance to the Whey using the square root of the mean square error
difference = melange - ref_data
mse = np.sqrt(np.mean(difference ** 2))
print(“Distance par rapport à la Whey :”, mse)
print()

# Convert melange to DataFrame
melange_df = pd.DataFrame([melange], columns=veg_data.columns)
# Add a "Type" column to the melange_df DataFrame to identify the type of data (blend or whey).
melange_df.insert(0, 'Type', 'Melange')
# Insert a "Type" column at the beginning of ref_data to identify the type of data (blend or whey).
ref_data.insert(0, 'Type', 'Whey')
# Concatenate DataFrames melange_df and ref_data
resultat_final = pd.concat([melange_df, ref_data], ignore_index=True)
# Display final table
print(resultat_final)

print("Minimum quantity of each plant product used :", poids_g * 100, "g")

else:
print(“No satisfactory mix found.”)
‘’’
I didn’t mention it above, but given the size of the database, I’ve started this code for just two products. I figure that if the code works for 2 products, it might work for 500, although I’ll probably need a more powerful machine to run the program.
If you have any ideas or questions, I’d love to hear from you. I really want to do this job… Thanks for your help!

Your code isn’t formatted correctly; please edit your post to put all your code within triple backticks like so:

```
Code goes here
```

You’ll probably need to restore the indentation too.

# Maximum weight of plant product
p_max = 301 #grammes

# Step
pas = 10 #grammes

# Loops :
for poids in range(0, p_max, pas):
    poids_g = poids / 100.0 # Convert weight to grams

    # Calculate the amino acid and protein composition of the blend for the selected weight
    melange = np.sum(veg_data * poids_g, axis=0)

    if np.all(melange < ref_data):
            continue
    if np.all(melange == ref_data):
            break
    if np.all(melange > ref_data):
            break        

if melange is not None :
    # Calculate the distance to the Whey using the square root of the mean square error
    difference = melange - ref_data
    mse = np.sqrt(np.mean(difference ** 2))
    print(“Distance par rapport à la Whey :”, mse)
    print()

    # Convert melange to DataFrame
    melange_df = pd.DataFrame([melange], columns=veg_data.columns)
    # Add a "Type" column to the melange_df DataFrame to identify the type of data (blend or 
    whey).
    melange_df.insert(0, 'Type', 'Melange')
    # Insert a "Type" column at the beginning of ref_data to identify the type of data (blend or whey).
    ref_data.insert(0, 'Type', 'Whey')
    # Concatenate DataFrames melange_df and ref_data
    resultat_final = pd.concat([melange_df, ref_data], ignore_index=True)
    # Display final table
    print(resultat_final)

    print("Minimum quantity of each plant product used :", poids_g * 100, "g")
else:
    print(“No satisfactory mix found.”)

I have another type of code, I don’t know wixh of them is better, I can show it to you too :

import numpy as np
import pandas as pd
from itertools import product
# Charger les données de la Whey
whey_data = pd.read_csv("MoyWhey.csv", delimiter=",", decimal=",")

# Charger les données des ingrédients végétaux
veg_data = pd.read_csv("MoyLim3.csv", delimiter=";", decimal=",")
veg_data.iloc[:, 1:] = veg_data.iloc[:, 1:].replace({',': '.'}, regex=True)

ref_data = whey_data.drop(columns=['Titre'])
veg_data = veg_data.drop(columns=['Titre'])
# Poids maximal du produit végétal
weights_max = 1000  # grammes
# Pas
step = 10  # grammes

# Fonction pour vérifier si la composition du mélange satisfait les critères
def is_satisfactory(melange, ref_data):
    return np.all(melange >= ref_data)

# Initialisation de la meilleure distance, du meilleur mélange et des meilleurs poids
best_distance = float('inf')
best_mixture = None
best_weights_g = None

# Boucles pour générer les combinaisons de produits
for weights in product(range(0, weights_max, step), repeat=len(veg_data)):
    weights_g = np.array(weights) / 100.0  # Convertir le poids en grammes
    mixture = np.sum(veg_data.values * weights_g[:, np.newaxis], axis=0)

    if is_satisfactory(mixture, ref_data):
        # Calcul de la distance par rapport à la Whey en utilisant la racine carrée de l'erreur quadratique moyenne
        difference = mixture - ref_data
        mse = np.sqrt(np.mean(difference ** 2))

        if mse < best_distance:
            best_distance = mse
            best_mixture = mixture
            best_weights_g = weights_g


# Afficher si un mélange satisfaisant a été trouvé
if best_mixture is not None:
    print("Distance par rapport à la Whey :", best_distance)
    print()

    # Convertir le meilleur mélange en DataFrame
    mixture_df = pd.DataFrame([best_mixture], columns=veg_data.columns)
    mixture_df.insert(0, 'Type', 'Mélange')

    # Insérer une colonne "Type" au début de ref_data pour identifier le type de données (mélange ou whey)
    ref_data_with_type = ref_data.copy()
    ref_data_with_type.insert(0, 'Type', 'Whey')

    # Concaténer les DataFrames
    final_result = pd.concat([mixture_df, ref_data_with_type], ignore_index=True)
    print(final_result)

    print("Quantité minimale de chaque produit végétal utilisée :", best_weights_g * 100, "g")
else:
    print("Aucun mélange satisfaisant trouvé.")

Are you trying to set ONE variable many times? The above code does it. This way poids_g will have got only one value. Namely the last calculated value

Can you give a (simplified and perhaps fake) example of your input data? And how you want to apply the selection criterium (it’s not really clear to me what that is)?

1 Like

When I do print(whey_data) and print(veg_data), I will have this :

   Titre  PROT  PHE  MET   LEU  ILE  LYS  VAL  THR  TRP  HIS  GLY  ARG  \
0    Whey  91.0  2.8  2.3  10.4  7.5  9.2  6.2  7.5  1.3  1.5  1.5  2.0   

   ASP/ASN  ALA  CYS  GLU/GLN  PRO  SER  TYR  
0     10.7  5.5  2.3     19.1  6.6  5.3  2.8  
                  Titre   PROT   PHE   MET   LEU   ILE   LYS   VAL   THR  \
0    Brown Rice Isolate  77.11  3.47  1.75  4.87  2.59  1.73  3.42  2.20   
1  Field Pea Concentrat  70.00  3.27  0.60  4.86  2.44  5.68  3.59  2.18   

    TRP   HIS   GLY   ARG  ASP/ASN   ALA   CYS  GLU/GLN   PRO   SER   TYR  
0  0.88  1.33  2.66  4.83     5.27  3.39  1.32    10.56  2.60  2.98  3.79  
1  0.36  2.37  3.15  9.95     8.11  2.89  0.63    11.47  3.01  3.47  2.35  

So you can see the composition of my reference, the Whey, and two of the 500 vegetal-based products.
With just these two products, for example, I want to create a loop that creates all possible mixtures of these two ingredients. This mixture can be composed of up to 1000 grams of the same product (p_max = 1000) with a step of 10 (step=10). The aim is to find the mix whose composition is closest to that of my reference, whey.

In most amino acid categories, your source materials are fairly similar to each other, and both very different from whey. “Complementary proteins” as described by vegetarians simply cannot reach the quality level of complete animal proteins. While they may meet the WHO’s (fairly meagre, in my fairly well-researched opinion) requirements for “complete” protein, they do so less efficiently. Aside from that, in particular, the provided numbers for two sample vegetable proteins don’t look as if they can meaningfully complement each other.

I do not mean to insult anyone; but before attempting to debug code that attempts to solve an equation, it’s important to ascertain whether a solution is actually expected.

1 Like

So you are looking for a linear regression (a least squares approximation, minimizing
|Ax - b|**2, where b is your Whey vector and A is the corresponding matrix that contains two or more product vectors)? If so, then with two (or more) non-Whey rows as in your example, you could do:

from io import StringIO
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = """Titre,PROT,PHE,MET,LEU,ILE,LYS,VAL,THR,TRP,HIS,GLY,ARG,ASP/ASN,ALA,CYS,GLU/GLN,PRO,SER,TYR
Whey,91.0,2.8,2.3,10.4,7.5,9.2,6.2,7.5,1.3,1.5,1.5,2.0,10.7,5.5,2.3,19.1,6.6,5.3,2.8
Rice,77.11,3.47,1.75,4.87,2.59,1.73,3.42,2.20,0.88,1.33,2.66,4.83,5.27,3.39,1.32,10.56,2.60,2.98,3.79
Pea,70.00,3.27,0.60,4.86,2.44,5.68,3.59,2.18,0.36,2.37,3.15,9.95,8.11,2.89,0.63,11.47,3.01,3.47,2.35
"""
df = pd.read_csv(StringIO(data))
titre = df.Titre
df.drop("Titre", axis=1, inplace=True)
whey = df.values[0]
A = df.values[1:].T
reg = LinearRegression().fit(A, whey)

#  reg.coef_
# array([0.93485556, 0.25152967])  
# reg.intercept_
# 1.7025666673079467
# reg.score(A, whey)
# 0.9780993079026414  - not too bad?

>>> whey
array([91. ,  2.8,  2.3, 10.4,  7.5, 
      9.2,  6.2,  7.5,  1.3,  1.5,
      1.5,   2. , 10.7,  5.5,  2.3,
      19.1,  6.6,  5.3,  2.8
>>> reg.predict(A)
array([91.39635599,  5.76901749,  3.4894817 ,  7.47774745,  4.73757497,
        4.74855534,  5.80276421,  4.30758359,  2.61579024,  3.54204989,
        4.98160093,  8.72063928,  8.66916112,  5.59864777,  3.0950397 ,
       14.45968673,  4.89029544,  5.3612442 ,  5.83676397])
# so, that's as good as you can get with a linear combination

Anyway, if I understood the problem correctly, then there is no need for any steps or looping - at least not for this part.
You might want to loop on different combinations of the other non-whey products, but you could also feed those in when you define A (if you allow combinations of more than two products). (I would start with simply loading all your products into A, then examine the coefficients, drop the ones with the smallest coefficients and see if the result is still good enough).

I don’t know anything about proteins, so I will leave that part of the discussion to you and Karl :slight_smile:

You’re oversimplifying the subject. I invite you to find out more before jumping to conclusions. Don’t hesitate to ask me for scientific articles if you wish!

1 Like

I will try this but I’m sure that I will not make arrays of all of my vegetal products, it will be a waist of time

Of course - but using the full matrix containing all the non-whey products, will show you in one step what the best possible solution is that can be achieved with any combination (best possible, given that you need a linear combination and given the particular “closeness” metric). Once you have that, you can then see if any smaller combination of two (or more) products would also be “good-enough” by dropping the rows that have the smallest coefficients (reg.coef_) and redoing the regression.

1 Like

I think you need positive=True

1 Like

@alicederyn – I think you’re right; I forgot about the context. @barau you need to initialize LinearRegression as LinearRegression(positive=True).

1 Like