Common Sring - creation family names from product names


I need to create family names from product names among each family code.

Okay, so what do you want to ask us about this task?

I guess you have in mind some kind of rule that tells you, how the family name should be calculated from that input. If you don’t, then I don’t know how we can tell you a rule - you’re the one who needs the names, so you’re the one who should know what they need to look like.

Assuming you have a rule, what problem did you run into when trying to write the code for it?

1 Like

I want to create a rule creating family name form product names with common strings

What kind of rule did you have in mind?

Please provide:

  1. Example input and desired output - in text form, not as an image
  2. The code you have written to attempt to do this
Family Code Family Name Product Code Product Name
171525 Enamel Plus HFO Stain Flow 146150 Enamel Plus HFO Stain Flow dkl-braun
171525 Enamel Plus HFO Stain Flow 146376 Enamel Plus HFO Stain Flow braun
171525 Enamel Plus HFO Stain Flow 171525 Enamel Plus HFO Stain Flow weiß
171525 Enamel Plus HFO Stain Flow 195671 Enamel Plus HFO Stain Flow orange
197901 Premise Unidose 192095 Premise Unidose Clear!Dose 0,2g
197901 Premise Unidose 198534 Premise unidose Dentin A3!Pack 20x0,2g
197901 Premise Unidose 202508 Premise Unidose A4!Pack 20x0,2g
197901 Premise Unidose 211820 Premise Unidose C3!Pack 20x0,2g
197901 Premise Unidose 211822 Premise Unidose C4!Pack 20x0,2g
197901 Premise Unidose 212156 Premise Unidose Dentin A 3,5!Pa 20x0,2g
197901 Premise Unidose 217124 Premise Unidose grau!Pack 20x0,2g
197901 Premise Unidose 264542 Premise Unidose Ref Dentine A4!P 20X0,2g
197901 Premise Unidose 264544 Premise Unidose Ref Dentine B2!P 20X0,2g
197901 Premise Unidose 264547 Premise Unidose C1!Pack 20X0,2g
197901 Premise Unidose 264554 Premise Unidose Refill Transl Clear
197901 Premise Unidose 195585 Premise Enamel Unidose A3,5!Pack 20x0,2g
197901 Premise Unidose 197899 Premise unidose enamel A 3!Pack 20x0,2g
197901 Premise Unidose 197901 Premise enamel unidose A 1!Pa 20x0,2g
197901 Premise Unidose 197914 Premise unidose enamel A 2!Pack 20x0,2g
217499 Premise Spritze Schmelz 197900 Premise Spritze Schmelz A2
217499 Premise Spritze Schmelz 197902 Premise Syringe Ref Dentin A 2!Spri 4g
217499 Premise Spritze Schmelz 211907 Premise Spritze Transluzent amber
217499 Premise Spritze Schmelz 212382 Premise Spritze Dentin A3
217499 Premise Spritze Schmelz 217499 Premise Spritze Schmelz A1
217499 Premise Spritze Schmelz 217888 Premise Spritze Schmelz A3.5
217499 Premise Spritze Schmelz 264518 PREMISE SYRINGE REF DENTIN A3.5!SPRI 4G
217499 Premise Spritze Schmelz 264519 Premise Spritze Dentin A4
217499 Premise Spritze Schmelz 264525 Premise Spritze Schmelz A4
217499 Premise Spritze Schmelz 264526 Premise Spritze Schmelz B1
217499 Premise Spritze Schmelz 264527 Premise Spritze Schmelz B2
217499 Premise Spritze Schmelz 264530 Premise Spritze Schmelz C3
217499 Premise Spritze Schmelz 264535 Premise Syringe Ref Enamel XL2!Spri 4g
217499 Premise Spritze Schmelz 264538 Premise Spritze Transluzent klar
217499 Premise Spritze Schmelz 194464 Premise Spritze Schmelz A3

I’m a Excel user and I noticed that there is function to use Python to cope with such issue.
I tried to generate such code in ChatGPT but without succes.

I have just noticed that it would be nice that code will choose the MOST OFTEN string from product names

You have yet to specify what your desired output looks like.

In any case, what you are asking for is possible, but not trivial. It sounds like you are looking for someone to write you a Python program from scratch. You are unlikely to find someone who will do that for free.

I would suggest you either learn Python yourself (start here), or pay someone to do it for you.

1 Like

Desired output is in second column.
Source for family names is in “Product Name” column.

1 Like

I see. That may in fact not be possible.

For example, consider:

Product Name: Premise Syringe Ref Dentin A 2!Spri 4g
Family Name: Premise Spritze Schmelz

There is no way to derive the family name from just the product name, because the family name contains “Schmelz” while the the product name does not.

It would not be possible as a simple, deterministic function (for instance a regex based function) that takes as input only a product name and spits out a family name. But that doesn’t imply it’s totally impossible. Also – even if you have a function that fails for a small number of inputs, as long as it runs ok for the majority of inputs, you could decide that this is “good-enough”.

One solution might be to use a LSH (localilty sensitive hash) library, like Annoy or Faiss. Those libs support setting up a table of words or phrases, and then searching for nearest neighbors of target phrases (which do not have to be in the table). The definition of “nearest neighbor” is also up to the user. (For instance it could be based on surface similarity of the spellings of the phrases, or on entirely different aspects.) So, the procedure would be like:

  • In advance set up the list of possible family names, or derive a candidate list, just by taking the first N words of the product name. You need to vet this list to make sure it makes sense.
  • Dump all the product names into the LSH table
  • For each product name, iterate over the family names and query the LSH to see which candidate family comes closest to an entry in the table.
    (Or add the family names to the table too, then query back product names, and in the output filter out all product names and only keep closest family names. That could be more efficient than iteration.)

It’s not so difficult to code this - but does require some familiarity with Python. Also requires some familiarity with word embeddings. (There are lots of tutorials for this. Looking for embeddings used to model poetry or to capture just letter bigrams seems most appropriate! The most simple, somewhat quirky but surprisingly effective approach might be the one used in Alison Parrish’s repo).

A simplistic variant (less effective, but perhaps still useful):

  • Make a list of candidate family names. Vet them.
  • Iterate over the product names and select the family name that matches the max number of words of the start of the product name.
def find_candidate_families(product_names: list[str], maxlen: int = 3) -> set[str]:
    # dumb function to initialize candidate family names
    # output should be vetted to see if they make sense
    candidates = set()
    for name in product_names:
        tokens = name.split()
        for i in range(1, maxlen + 1):
             head = " ".join(tokens[:i]
    return candidates 

def map_product_to_family(product_name: str, family_names: set[str], maxlen: int = 3) -> str:
    # try to find the longest match in a given set of family names
    # uses the set of family names returned by the previous function
    tokens = product_name.split()
    # if family name is always shorter than product, 
    # then we only need search up to len(tokens) - 1
    # otherwise, the next line needs to be commented out
    maxlen = min(maxlen, len(tokens) - 1)  
    for i in range(maxlen, 0, -1):
        candidate = " ".join(tokens[:i])
        if candidate in family_names:
             return candidate
    return "UNKNOWN" 
1 Like

I assume that by this you mean the most frequent common prefix. Given a subset of similar product names, the family name would correspond to the most frequent common prefix of those names.

Below is a sketch of code you could use to do something like that. I’ll try to keep this as simple as possible, but to actually use it you may still need to tweak it. The input data is also a bit messy (capitalization issues) which I’m ignoring here.

from collections import Counter

def count_prefixes(products: list[str], ntokens: int = 7) -> dict[str, int]:
    # get a count of all prefixes that consist of ntokens tokens
    c = Counter()
    for p in products:
        tokens = p.split()
        prefix = " ".join(tokens[:nokens])
    return c

def get_families(products: list[str], maxlen: int = 7) -> dict[str, int]:
    # returns dictionary containing possible family names
    res = dict()
    for i in range(maxlen, 0, -1):
        c = count_prefixes(products, i)
    # filter out items that were only seen once
    families = dict((k, v) for (k, v) in res.items() if v > 1)
    return families

def map_product_to_family(product: str, families: dict[str, int]) -> str:
    # try to map a product to a family
    # takes dict of candidate families as input
    # tries to find the longest matching prefix
    tokens = product.split()
    for i in range(len(tokens) - 1, 0, -1):  # from len(tokens) - 1 downto 0
        prefix = " ".join(tokens[:i])
        if prefix in families:
            return prefix
    return "UNKNOWN"
1 Like