How to perform advance encoding and manipulation mention below

We want to get data as per “Data Structure required” from “Data Available”. It’s basically in the form of a Term document matrix. We make all the unique names present in Name columns, an individual separate column. Whichever name is present in the corresponding row, that column’s name will be assigned with 1, the rest of the rows will become 0 as shown in the XLSX file. Looking forward to a solution.

That is a nasty data structure to create in Python.

The easiest way looks to me to iterate over the data available, creating a dictionary with the name as the key and a (zero-based) sequence as the value, counting them as you go. Of course if an item already exists don’t add it twice.

Afterwards for each “row” in your matrix, create a list with zeros, length equal to the count and than fill each item with a one if it occurs in the row. So for the third line, lookup “lin”, “kin” and “rin” in the dictionary and fill the associated columns (index is the value) with a 1. Append to your result list.

As you can see this is a rather roundabout process. If you know more about the use, better structures are conceivable.

If you only are interested in finding out if a combination exists, create a set with combinations of a row name with the attribute. If someone wants to know if a “row” has a certain attribute, look it up as (row, attribute name); return 1 if it exists, 0 if it doesn’t. To make more clear what is meant, return True or False respectively.

If you want to know which rows contain a specific attribute, create a multidict with attribute as the key and the rows that have that attribute.

2 Likes