Hi,
I need to create family names from product names among each family code.
Okay, so what do you want to ask us about this task?
I guess you have in mind some kind of rule that tells you, how the family name should be calculated from that input. If you don’t, then I don’t know how we can tell you a rule - you’re the one who needs the names, so you’re the one who should know what they need to look like.
Assuming you have a rule, what problem did you run into when trying to write the code for it?
I want to create a rule creating family name form product names with common strings
What kind of rule did you have in mind?
Please provide:
Family Code | Family Name | Product Code | Product Name |
---|---|---|---|
171525 | Enamel Plus HFO Stain Flow | 146150 | Enamel Plus HFO Stain Flow dkl-braun |
171525 | Enamel Plus HFO Stain Flow | 146376 | Enamel Plus HFO Stain Flow braun |
171525 | Enamel Plus HFO Stain Flow | 171525 | Enamel Plus HFO Stain Flow weiĂź |
171525 | Enamel Plus HFO Stain Flow | 195671 | Enamel Plus HFO Stain Flow orange |
197901 | Premise Unidose | 192095 | Premise Unidose Clear!Dose 0,2g |
197901 | Premise Unidose | 198534 | Premise unidose Dentin A3!Pack 20x0,2g |
197901 | Premise Unidose | 202508 | Premise Unidose A4!Pack 20x0,2g |
197901 | Premise Unidose | 211820 | Premise Unidose C3!Pack 20x0,2g |
197901 | Premise Unidose | 211822 | Premise Unidose C4!Pack 20x0,2g |
197901 | Premise Unidose | 212156 | Premise Unidose Dentin A 3,5!Pa 20x0,2g |
197901 | Premise Unidose | 217124 | Premise Unidose grau!Pack 20x0,2g |
197901 | Premise Unidose | 264542 | Premise Unidose Ref Dentine A4!P 20X0,2g |
197901 | Premise Unidose | 264544 | Premise Unidose Ref Dentine B2!P 20X0,2g |
197901 | Premise Unidose | 264547 | Premise Unidose C1!Pack 20X0,2g |
197901 | Premise Unidose | 264554 | Premise Unidose Refill Transl Clear |
197901 | Premise Unidose | 195585 | Premise Enamel Unidose A3,5!Pack 20x0,2g |
197901 | Premise Unidose | 197899 | Premise unidose enamel A 3!Pack 20x0,2g |
197901 | Premise Unidose | 197901 | Premise enamel unidose A 1!Pa 20x0,2g |
197901 | Premise Unidose | 197914 | Premise unidose enamel A 2!Pack 20x0,2g |
217499 | Premise Spritze Schmelz | 197900 | Premise Spritze Schmelz A2 |
217499 | Premise Spritze Schmelz | 197902 | Premise Syringe Ref Dentin A 2!Spri 4g |
217499 | Premise Spritze Schmelz | 211907 | Premise Spritze Transluzent amber |
217499 | Premise Spritze Schmelz | 212382 | Premise Spritze Dentin A3 |
217499 | Premise Spritze Schmelz | 217499 | Premise Spritze Schmelz A1 |
217499 | Premise Spritze Schmelz | 217888 | Premise Spritze Schmelz A3.5 |
217499 | Premise Spritze Schmelz | 264518 | PREMISE SYRINGE REF DENTIN A3.5!SPRI 4G |
217499 | Premise Spritze Schmelz | 264519 | Premise Spritze Dentin A4 |
217499 | Premise Spritze Schmelz | 264525 | Premise Spritze Schmelz A4 |
217499 | Premise Spritze Schmelz | 264526 | Premise Spritze Schmelz B1 |
217499 | Premise Spritze Schmelz | 264527 | Premise Spritze Schmelz B2 |
217499 | Premise Spritze Schmelz | 264530 | Premise Spritze Schmelz C3 |
217499 | Premise Spritze Schmelz | 264535 | Premise Syringe Ref Enamel XL2!Spri 4g |
217499 | Premise Spritze Schmelz | 264538 | Premise Spritze Transluzent klar |
217499 | Premise Spritze Schmelz | 194464 | Premise Spritze Schmelz A3 |
I’m a Excel user and I noticed that there is function to use Python to cope with such issue.
I tried to generate such code in ChatGPT but without succes.
I have just noticed that it would be nice that code will choose the MOST OFTEN string from product names
You have yet to specify what your desired output looks like.
In any case, what you are asking for is possible, but not trivial. It sounds like you are looking for someone to write you a Python program from scratch. You are unlikely to find someone who will do that for free.
I would suggest you either learn Python yourself (start here), or pay someone to do it for you.
Desired output is in second column.
Source for family names is in “Product Name” column.
I see. That may in fact not be possible.
For example, consider:
Product Name: Premise Syringe Ref Dentin A 2!Spri 4g
Family Name: Premise Spritze Schmelz
There is no way to derive the family name from just the product name, because the family name contains “Schmelz” while the the product name does not.
It would not be possible as a simple, deterministic function (for instance a regex based function) that takes as input only a product name and spits out a family name. But that doesn’t imply it’s totally impossible. Also – even if you have a function that fails for a small number of inputs, as long as it runs ok for the majority of inputs, you could decide that this is “good-enough”.
One solution might be to use a LSH (localilty sensitive hash) library, like Annoy or Faiss. Those libs support setting up a table of words or phrases, and then searching for nearest neighbors of target phrases (which do not have to be in the table). The definition of “nearest neighbor” is also up to the user. (For instance it could be based on surface similarity of the spellings of the phrases, or on entirely different aspects.) So, the procedure would be like:
It’s not so difficult to code this - but does require some familiarity with Python. Also requires some familiarity with word embeddings. (There are lots of tutorials for this. Looking for embeddings used to model poetry or to capture just letter bigrams seems most appropriate! The most simple, somewhat quirky but surprisingly effective approach might be the one used in Alison Parrish’s repo).
A simplistic variant (less effective, but perhaps still useful):
def find_candidate_families(product_names: list[str], maxlen: int = 3) -> set[str]:
# dumb function to initialize candidate family names
# output should be vetted to see if they make sense
candidates = set()
for name in product_names:
tokens = name.split()
for i in range(1, maxlen + 1):
head = " ".join(tokens[:i]
candidates.add(head)
return candidates
def map_product_to_family(product_name: str, family_names: set[str], maxlen: int = 3) -> str:
# try to find the longest match in a given set of family names
# uses the set of family names returned by the previous function
tokens = product_name.split()
# if family name is always shorter than product,
# then we only need search up to len(tokens) - 1
# otherwise, the next line needs to be commented out
maxlen = min(maxlen, len(tokens) - 1)
for i in range(maxlen, 0, -1):
candidate = " ".join(tokens[:i])
if candidate in family_names:
return candidate
return "UNKNOWN"
I assume that by this you mean the most frequent common prefix. Given a subset of similar product names, the family name would correspond to the most frequent common prefix of those names.
Below is a sketch of code you could use to do something like that. I’ll try to keep this as simple as possible, but to actually use it you may still need to tweak it. The input data is also a bit messy (capitalization issues) which I’m ignoring here.
from collections import Counter
def count_prefixes(products: list[str], ntokens: int = 7) -> dict[str, int]:
# get a count of all prefixes that consist of ntokens tokens
c = Counter()
for p in products:
tokens = p.split()
prefix = " ".join(tokens[:nokens])
c.update([prefix])
return c
def get_families(products: list[str], maxlen: int = 7) -> dict[str, int]:
# returns dictionary containing possible family names
res = dict()
for i in range(maxlen, 0, -1):
c = count_prefixes(products, i)
res.update(c)
# filter out items that were only seen once
families = dict((k, v) for (k, v) in res.items() if v > 1)
return families
def map_product_to_family(product: str, families: dict[str, int]) -> str:
# try to map a product to a family
# takes dict of candidate families as input
# tries to find the longest matching prefix
tokens = product.split()
for i in range(len(tokens) - 1, 0, -1): # from len(tokens) - 1 downto 0
prefix = " ".join(tokens[:i])
if prefix in families:
return prefix
return "UNKNOWN"