Why is SET keeping duplicates?

Mechanix2Go · July 4, 2022, 12:31pm

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32



punc = """~`!#$%^&*()'+=?<>|/	 \,{}[]:;"""


with open("in.txt", "r") as intxt:
    with open("out.txt", "w") as outtxt:
        adds = set()
        k = ""
        for line in intxt:
            for c in line:
                if c in punc or ord(c) > 127:
                    k += " "
                else:
                    k += c
        adds.add(k)
        for a in adds:
            outtxt.write(a)
print(adds)

# output
# {'ABC\nDEF\nABC\n'}```

vbrozik · July 4, 2022, 1:37pm

Your code does not show up correctly because it is formatted as Markdown here.

Could you please edit you post and put triple backticks around the lines with the code?

```
# Your code will be here.
```

If you cannot find them on your keyboard, copy them from my post or use the “Preformatted text” </> button of the forum editor.

Mechanix2Go · July 4, 2022, 1:53pm

Thank you for your help. I will do what you said.

steven.daprano · July 4, 2022, 1:49pm

Your set has a single element, a string of length 12: 'ABC\nDEF\nABC\n'.

There are no duplicates, because there is only one element.

Of course the string element itself can include the same letter more than once:

print( {'moon', 'sun', 'sun', 'apple', 'banana'} )

# Output is {'moon', 'sun', 'banana', 'apple'}

If you explain what output you expected, maybe we can help you.

Mechanix2Go · July 4, 2022, 2:03pm

Thank you for your help.
I’ll work on it.

vbrozik · July 4, 2022, 2:07pm

Inside the for loops you append the characters to the string k.

When both the loops are finished, you add the whole constructed string to the set as a single element:

        adds.add(k)

So the set adds contains only the single string.

I am guessing that you want to have the individual characters in the set adds, right? There are two basic possibilities.

a) You do not need the string k → then instead of k += something do:

adds.add(something)

b) You need k → then instead of adds.add(k) create the set just at the end by filling it from the individual characters of k:

adds = set(k)

Please show us the result

Mechanix2Go · July 4, 2022, 3:28pm

Thank you. I’ll get on it. You can tell I’m new at this.

mlgtechuser · July 5, 2022, 6:09am

Hi @Mechanix2Go. I find that messing around with strings is always a good place to launch into general-purpose programming languages. Python is especially fun for string manipulation!

I see from the title that there are duplicates in a set and those duplicates are presumably undesirable.
Some sleuthing of your code block reveals adds = set() and a bracketed output set: {'ABC\nDEF\nABC\n'} with duplicates, so this is probably the culprit. I love puzzles, but prefer not to guess at what a poster is trying to accomplish and would like help with–even when there are plenty of clues as in this case. Here are some useful guidelines to help the community here provide help:

Paste some input data.
Paste enough output data to show the undesired result and its context.
Show the output you expected as a corrected version of the output you’re currently getting.

steven.daprano · July 5, 2022, 8:56am

Your analysis is not quite accurate.

Try this to get a better idea of what is going on:

aset = set()
for c in ('a', 'b', 'c', 'a', 'b', 'c'):
    # No need to check if the element is already in the set.
    aset.add(c)

print(aset)

astring = ''
for c in ('a', 'b', 'c', 'a', 'b', 'c'):
    astring += c

aset.add(astring)
print(aset)

Mechanix2Go · July 7, 2022, 5:50pm

Thank you, I will

Mechanix2Go · July 14, 2022, 3:18pm

Thank you for your help here
I started out to extract email addresses by stripping punctuation except @._-
I thought to use SET to eliminate dups
That’s when I got wrapped around the axel
This is working


import re


punc = """~`!#$%^&*()'+=?<>|/	 \,{}[]:;"""

with open("in.txt", "r") as intxt:
    with open("out.txt", "w") as outtxt:
        k = ""
        for line in intxt:
            for c in line:
                if c in punc or ord(c) > 127:
                    k += " "
                else:
                    k += c
        while "  " in k:
            k = re.sub("  ", " ", k)
        k = re.sub(" ", "\n", k)
        outtxt.write(k)


with open("out.txt", "r") as src:
    with open("email.txt", "w") as dest:
        a = set()
        for s in src:
            if "@" in s:
                a.add(s)
        for e in a:
            dest.write(e)

vbrozik · July 14, 2022, 5:16pm

That is great you made it working. Some suggestions:

I see that you stated to use regexes. You can simplify the first part greatly by defining the characters to include (instead of characters to exclude) and by using re.findall().

import re

test_string = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
"""

re.findall(r'\w+', test_string)  # extracts alphanumeric words into a list
re.findall(r'[-@.\w]+', test_string)  # rough extraction of email addresses

Later you can improve the regex to extract only parts which contain @. You can find regexes to match email addresses on the internet. This will eliminate a lot of processing, make the program more robust and simplify the rest of the code.
If you are creating the file out.txt just for the second part of the program, create a list instead. It will be much faster and it will not leave residues on your file system.
It is better to move the code which does not need to be inside a context manager out of it. Then you do not occupy the resources when they are not needed. For example the second part can be reworked like this:

a = set()
with open("out.txt", "r") as src:
    for s in src:
        if "@" in s:  # This condition can be removed after improving the regex.
            a.add(s)
with open("email.txt", "w") as dest:
    for e in a:
        dest.write(e)

As a by-product the code is less indented and easier to read.
Try to use descriptive names of variables. This improves the code readability a lot. Right comments further improve the ability to understand the code. For example:

emails = set()
with open("out.txt", "r") as source_file:
    for word in source_file:  # The file contains one word per line.
        if "@" in word:
            emails.add(word)
with open("email.txt", "w") as destination_file:
    for email in emails:
        destination_file.write(email)  # email is already terminated by a newline

Why is SET keeping duplicates?

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32

Thank you for your help here I started out to extract email addresses by stripping punctuation except @._- I thought to use SET to eliminate dups That’s when I got wrapped around the axel This is working

Thank you for your help here
I started out to extract email addresses by stripping punctuation except @._-
I thought to use SET to eliminate dups
That’s when I got wrapped around the axel
This is working