Processing CSV type file and creating new output using Python

I have a file ‘linked_policies.txt’ that looks like the following (small sample of data):

0000001G|11111111
0000002G|10000018
0000002G|10000320
0000002G|10000337
0000002G|10000343
101359B|10000018
101359B|16023380
504529A|10000018
504529A|15856008
504529A|16007139
504529A|16007151
504529A|16526483
620667G|16526483
0000003G|22222222
0000003G|33333333

The first column represents ‘policy’ and second column represents ‘customer’.

Policy 0000001G is not linked to any other policies as it contains only one customer number not linked to any other policies.

Similarly, Policy 0000003G is not linked to any other policies as it contains two customer numbers not linked to any other policies.

All of the other policies (0000002G, 101359B, 504529A and 620667G) are considered to be linked to each other because for instance policy 0000002G is directly linked to policies 101359B and 504529A (as they all share the same customer 10000018) but policy 0000002G is also indirectly linked to policy 620667G because even though they do not share a customer number directly between them, policy 620667G shares customer 16526483 with policy 504529A so therefore all are linked.

I want to process the input file and end up with a new ‘output_file.txt’ whereby for each policy, the policy is listed in the first column and in the second column (space delimited) the list of associated policy/ies are listed (including the policy in question itself). For example, from the sample input file following processing, file ‘output_file.txt’ would look as follows:

0000001G 0000001G
0000002G 0000002G|101359B|504529A|620667G
101359B 101359B|0000002G|504529A|620667G
504529A 504529A|0000002G|101359B|620667G
0000003G 0000003G

First column shows each unique policy number (from the input file) and the second column shows all of the policy/ies that this policy is directly or indirectly linked to.

Looking for a way for this to be done using Python (noting that the Python version available to me is earlier than version 3.6)

My (Python) script “linked_policies.py” attempt is as follows:

# Read the input data
with open('linked_policies.txt', 'r') as f:
    lines = f.readlines()

# Create a dictionary to store the customers and their associated policies
customer_policies = {}

# Iterate through the lines
for line in lines:
    # Split the line into policy and customer
    policy, customer = line.strip().split('|')

    # Add the customer to the customer_policies dictionary
    if customer not in customer_policies:
        customer_policies[customer] = set()
    customer_policies[customer].add(policy)

# Create a dictionary to store the linked policies
linked_policies = {}

# Link policies based on shared customers
for customer, policies in customer_policies.items():
    for policy in policies:
        if policy not in linked_policies:
            linked_policies[policy] = set(policies)
        else:
            linked_policies[policy].update(policies)

# Write the output to a file
with open('output_file.txt', 'w') as f:
    for policy, linked_policies_set in linked_policies.items():
        linked_policies_str = '|'.join(sorted(linked_policies_set))
        f.write('{} {}\n'.format(policy, linked_policies_str))

…when I run this:

python ./linked_policies.py

…the resulting output file “output_file.txt” looks like this:

620667G 504529A|620667G
504529A 0000002G|101359B|504529A|620667G
0000003G 0000003G
0000002G 0000002G|101359B|504529A
101359B 0000002G|101359B|504529A
0000001G 0000001G

The second, third and sixth lines are all correct.

The other lines are incorrect (first, fourth and fifth lines) as they are not correctly showing all of the linked policies.

For a given policy, its associated policy shares a customer either with that policy, or with one of its (already) associated policies (policy association is transitive).

Is it correct to say the single policy customers don’t matter for this calculation? So after they’re all removed, the ones that are left (customers named on mutiple policies) define 1 or more edges in a Graph (the nodes/vertices of which are the policies).

I’d look for a Graph analysis library to compute distinct connected sub graphs.

2 Likes

To ensure all linked policies are accurately captured, you’ll need to implement a more comprehensive method to explore the transitive connections between policies. One effective approach is to use a union-find (disjoint-set) data structure to manage the linkage of policies. This will help in efficiently managing and merging sets of linked policies.

Here’s a revised version of your script that incorporates this approach:


class UnionFind:
    def __init__(self):
        self.parent = {}
        self.rank = {}

    def find(self, item):
        if self.parent[item] != item:
            self.parent[item] = self.find(self.parent[item])
        return self.parent[item]

    def union(self, set1, set2):
        root1 = self.find(set1)
        root2 = self.find(set2)

        if root1 != root2:
            if self.rank[root1] > self.rank[root2]:
                self.parent[root2] = root1
            elif self.rank[root1] < self.rank[root2]:
                self.parent[root1] = root2
            else:
                self.parent[root2] = root1
                self.rank[root1] += 1

    def add(self, item):
        if item not in self.parent:
            self.parent[item] = item
            self.rank[item] = 0

# Read the input data
with open('linked_policies.txt', 'r') as f:
    lines = f.readlines()

# Initialize Union-Find structure
uf = UnionFind()

# Create a dictionary to store the customers and their associated policies
customer_policies = {}

# Iterate through the lines
for line in lines:
    # Split the line into policy and customer
    policy, customer = line.strip().split('|')

    # Add policies to union-find structure
    uf.add(policy)
    
    # Add the customer to the customer_policies dictionary
    if customer not in customer_policies:
        customer_policies[customer] = set()
    customer_policies[customer].add(policy)

# Link policies based on shared customers
for customer, policies in customer_policies.items():
    policies = list(policies)
    for i in range(len(policies)):
        for j in range(i + 1, len(policies)):
            uf.union(policies[i], policies[j])

# Group policies by their root parent
linked_policies = {}
for policy in uf.parent:
    root = uf.find(policy)
    if root not in linked_policies:
        linked_policies[root] = set()
    linked_policies[root].add(policy)

# Write the output to a file
with open('output_file.txt', 'w') as f:
    for root, policies in linked_policies.items():
        policies_str = '|'.join(sorted(policies))
        for policy in sorted(policies):
            f.write('{} {}\n'.format(policy, policies_str))

Explanation

  1. Union-Find Class: This class helps efficiently manage the merging and finding of policy sets.
  • find: Finds the root of the set containing the item.
  • union: Merges two sets.
  • add: Adds a new item to the union-find structure.
  1. Initialize Union-Find: Create an instance of the UnionFind class.
  2. Parse Input and Populate Union-Find:
  • Read each line, split into policy and customer.
  • Add each policy to the union-find structure.
  • Maintain a dictionary of customers and associated policies.
  1. Union Operations for Linked Policies:
  • For each customer, link all associated policies using the union operation.
  1. Group Policies by Root Parent:
  • After all union operations, group policies by their root parent.
  1. Write Output:
  • Write each policy and its linked policies to the output file.

This approach ensures that all policies directly or indirectly linked through shared customers are correctly grouped together.