Common Sring - creation family names from product names

jagger3 · February 12, 2024, 10:25am

Hi,

I need to create family names from product names among each family code.

kknechtel · February 12, 2024, 10:32am

Okay, so what do you want to ask us about this task?

I guess you have in mind some kind of rule that tells you, how the family name should be calculated from that input. If you don’t, then I don’t know how we can tell you a rule - you’re the one who needs the names, so you’re the one who should know what they need to look like.

Assuming you have a rule, what problem did you run into when trying to write the code for it?

jagger3 · February 12, 2024, 10:33am

I want to create a rule creating family name form product names with common strings

abessman · February 12, 2024, 10:48am

What kind of rule did you have in mind?

Please provide:

Example input and desired output - in text form, not as an image
The code you have written to attempt to do this

jagger3 · February 12, 2024, 11:11am

Family Code	Family Name	Product Code	Product Name
171525	Enamel Plus HFO Stain Flow	146150	Enamel Plus HFO Stain Flow dkl-braun
171525	Enamel Plus HFO Stain Flow	146376	Enamel Plus HFO Stain Flow braun
171525	Enamel Plus HFO Stain Flow	171525	Enamel Plus HFO Stain Flow weiß
171525	Enamel Plus HFO Stain Flow	195671	Enamel Plus HFO Stain Flow orange
197901	Premise Unidose	192095	Premise Unidose Clear!Dose 0,2g
197901	Premise Unidose	198534	Premise unidose Dentin A3!Pack 20x0,2g
197901	Premise Unidose	202508	Premise Unidose A4!Pack 20x0,2g
197901	Premise Unidose	211820	Premise Unidose C3!Pack 20x0,2g
197901	Premise Unidose	211822	Premise Unidose C4!Pack 20x0,2g
197901	Premise Unidose	212156	Premise Unidose Dentin A 3,5!Pa 20x0,2g
197901	Premise Unidose	217124	Premise Unidose grau!Pack 20x0,2g
197901	Premise Unidose	264542	Premise Unidose Ref Dentine A4!P 20X0,2g
197901	Premise Unidose	264544	Premise Unidose Ref Dentine B2!P 20X0,2g
197901	Premise Unidose	264547	Premise Unidose C1!Pack 20X0,2g
197901	Premise Unidose	264554	Premise Unidose Refill Transl Clear
197901	Premise Unidose	195585	Premise Enamel Unidose A3,5!Pack 20x0,2g
197901	Premise Unidose	197899	Premise unidose enamel A 3!Pack 20x0,2g
197901	Premise Unidose	197901	Premise enamel unidose A 1!Pa 20x0,2g
197901	Premise Unidose	197914	Premise unidose enamel A 2!Pack 20x0,2g
217499	Premise Spritze Schmelz	197900	Premise Spritze Schmelz A2
217499	Premise Spritze Schmelz	197902	Premise Syringe Ref Dentin A 2!Spri 4g
217499	Premise Spritze Schmelz	211907	Premise Spritze Transluzent amber
217499	Premise Spritze Schmelz	212382	Premise Spritze Dentin A3
217499	Premise Spritze Schmelz	217499	Premise Spritze Schmelz A1
217499	Premise Spritze Schmelz	217888	Premise Spritze Schmelz A3.5
217499	Premise Spritze Schmelz	264518	PREMISE SYRINGE REF DENTIN A3.5!SPRI 4G
217499	Premise Spritze Schmelz	264519	Premise Spritze Dentin A4
217499	Premise Spritze Schmelz	264525	Premise Spritze Schmelz A4
217499	Premise Spritze Schmelz	264526	Premise Spritze Schmelz B1
217499	Premise Spritze Schmelz	264527	Premise Spritze Schmelz B2
217499	Premise Spritze Schmelz	264530	Premise Spritze Schmelz C3
217499	Premise Spritze Schmelz	264535	Premise Syringe Ref Enamel XL2!Spri 4g
217499	Premise Spritze Schmelz	264538	Premise Spritze Transluzent klar
217499	Premise Spritze Schmelz	194464	Premise Spritze Schmelz A3

jagger3 · February 12, 2024, 11:13am

I’m a Excel user and I noticed that there is function to use Python to cope with such issue.
I tried to generate such code in ChatGPT but without succes.

jagger3 · February 12, 2024, 11:16am

I have just noticed that it would be nice that code will choose the MOST OFTEN string from product names

abessman · February 12, 2024, 11:25am

You have yet to specify what your desired output looks like.

In any case, what you are asking for is possible, but not trivial. It sounds like you are looking for someone to write you a Python program from scratch. You are unlikely to find someone who will do that for free.

I would suggest you either learn Python yourself (start here), or pay someone to do it for you.

jagger3 · February 12, 2024, 11:28am

Desired output is in second column.
Source for family names is in “Product Name” column.

abessman · February 12, 2024, 11:55am

I see. That may in fact not be possible.

For example, consider:

Product Name: Premise Syringe Ref Dentin A 2!Spri 4g
Family Name: Premise Spritze Schmelz

There is no way to derive the family name from just the product name, because the family name contains “Schmelz” while the the product name does not.

hansgeunsmeyer · February 12, 2024, 2:30pm

It would not be possible as a simple, deterministic function (for instance a regex based function) that takes as input only a product name and spits out a family name. But that doesn’t imply it’s totally impossible. Also – even if you have a function that fails for a small number of inputs, as long as it runs ok for the majority of inputs, you could decide that this is “good-enough”.

One solution might be to use a LSH (localilty sensitive hash) library, like Annoy or Faiss. Those libs support setting up a table of words or phrases, and then searching for nearest neighbors of target phrases (which do not have to be in the table). The definition of “nearest neighbor” is also up to the user. (For instance it could be based on surface similarity of the spellings of the phrases, or on entirely different aspects.) So, the procedure would be like:

In advance set up the list of possible family names, or derive a candidate list, just by taking the first N words of the product name. You need to vet this list to make sure it makes sense.
Dump all the product names into the LSH table
For each product name, iterate over the family names and query the LSH to see which candidate family comes closest to an entry in the table.
(Or add the family names to the table too, then query back product names, and in the output filter out all product names and only keep closest family names. That could be more efficient than iteration.)

It’s not so difficult to code this - but does require some familiarity with Python. Also requires some familiarity with word embeddings. (There are lots of tutorials for this. Looking for embeddings used to model poetry or to capture just letter bigrams seems most appropriate! The most simple, somewhat quirky but surprisingly effective approach might be the one used in Alison Parrish’s repo).

A simplistic variant (less effective, but perhaps still useful):

Make a list of candidate family names. Vet them.
Iterate over the product names and select the family name that matches the max number of words of the start of the product name.

def find_candidate_families(product_names: list[str], maxlen: int = 3) -> set[str]:
    # dumb function to initialize candidate family names
    # output should be vetted to see if they make sense
    candidates = set()
    for name in product_names:
        tokens = name.split()
        for i in range(1, maxlen + 1):
             head = " ".join(tokens[:i]
             candidates.add(head)
    return candidates 

def map_product_to_family(product_name: str, family_names: set[str], maxlen: int = 3) -> str:
    # try to find the longest match in a given set of family names
    # uses the set of family names returned by the previous function
    tokens = product_name.split()
    # if family name is always shorter than product, 
    # then we only need search up to len(tokens) - 1
    # otherwise, the next line needs to be commented out
    maxlen = min(maxlen, len(tokens) - 1)  
    for i in range(maxlen, 0, -1):
        candidate = " ".join(tokens[:i])
        if candidate in family_names:
             return candidate
    return "UNKNOWN"

hansgeunsmeyer · February 12, 2024, 4:21pm

I assume that by this you mean the most frequent common prefix. Given a subset of similar product names, the family name would correspond to the most frequent common prefix of those names.

Below is a sketch of code you could use to do something like that. I’ll try to keep this as simple as possible, but to actually use it you may still need to tweak it. The input data is also a bit messy (capitalization issues) which I’m ignoring here.

from collections import Counter

def count_prefixes(products: list[str], ntokens: int = 7) -> dict[str, int]:
    # get a count of all prefixes that consist of ntokens tokens
    c = Counter()
    for p in products:
        tokens = p.split()
        prefix = " ".join(tokens[:nokens])
        c.update([prefix])
    return c

def get_families(products: list[str], maxlen: int = 7) -> dict[str, int]:
    # returns dictionary containing possible family names
    res = dict()
    for i in range(maxlen, 0, -1):
        c = count_prefixes(products, i)
        res.update(c)
    # filter out items that were only seen once
    families = dict((k, v) for (k, v) in res.items() if v > 1)
    return families

def map_product_to_family(product: str, families: dict[str, int]) -> str:
    # try to map a product to a family
    # takes dict of candidate families as input
    # tries to find the longest matching prefix
    tokens = product.split()
    for i in range(len(tokens) - 1, 0, -1):  # from len(tokens) - 1 downto 0
        prefix = " ".join(tokens[:i])
        if prefix in families:
            return prefix
    return "UNKNOWN"

Topic		Replies	Views
How to use number dataset file as an input for max_pairwise_product problem in python programming? Python Help help	0	259	February 3, 2021
Looking for suitable data clustering Python Help	0	283	February 16, 2021
Please Rate My Coding and Answers Python Help	0	410	June 15, 2023
Loops on Python Python Help help	13	286	April 10, 2024
Rewriting python code Python Help	4	1203	March 16, 2023

Common Sring - creation family names from product names

Related Topics