Punctuation problems

I have an unknown String of multiple words which may, or may not, include punctuation.

For this assignment I’ve been tasked with converting it to pig-latin. I’ve figured out how to modify the input into the correct string manipulation but I fail when there is punctuation involved.

I’ve found several ways to remove punctuation from a string but what I’m actually trying to do is remove everything but the punctuation from the original string (not 100% sure if strip() would work), modify it and then add the original punctuation back into the new string at the appropriate location. (since everything is followed by “ay” that location would change by two index positions.

I cannot assume that the punctuation will always be at the end of the string so simple slicing won’t work.

This is the syntax I have so far:

def pig_it(text):
    S_strings = text.split()
    S_list = []

    for x in S_strings:
        S_list.append(x[1:] + x[0] + 'ay')

    return ' '.join(S_list)

My initial split sticks the punctuation to the correct location in the process, so maybe working with only those elements of the split string would be better? I’m just not sure if I’m approaching this correct way… should I be pulling out the punctuation instead and storing it to be added back later at two higher index points?

I would consider looking into a regular expression, particularly re.split(). Maybe a pattern that splits on non-word characters? Only because this is an assignment, will I let you dig a little deeper without spelling it out. Hopefully this helps.

2 Likes

I’d suggest using regex and re.sub to match the words and replace them with the new word. (re.sub will accept a function as the replacement.)

4 Likes

If I’m understanding what I read correctly,

w = re.compile(r'\w+') 
w.split(text) 
# This should only split the text into alphanumeric characters which would not include punctuation.

p = re.compile(r'\W')
# I think this would locate any punctuation in the string.  

The concern I have, is losing the index position of the punctuation when splitting it, since it has to be added in again later after the “ay”. Possibly enumerate to find the position and then replace it at the new modified index position?

Also, what if there are contractions in the text, I would need to allow the RE to allow for '.

You would use an inverse word pattern to select anything not a word.

>>> import re
>>> re.split(r'[^\w]+', 'this, that, another')
['this', 'that', 'another']

What @MRAB suggests actually will target a word for replacement oppossed to splitting the string into strings of words. I originally made a suggestion to go along with what you already had, but using re.sub would require you to adjust your logic, but is a great solution as well.

1 Like

Here’s a simple example that converts words to titlecase:

def change_word(match):
    word = match.group()
    return word.title()

text = 'hello, world!'
changed_text = re.sub(r'\w+', change_word, text)
print('Original:', text)
print('Changed :', changed_text)
2 Likes

By Isaac Muse via Discussions on Python.org at 14Sep2022 20:53:

You would use an inverse word pattern to select anything not a word.

>>> import re
>>> re.split(r'[^\w]+', 'this, that, another')
['this', 'that', 'another']

BTW, you can spell [^\w]+ as \W+, which is more readable.

Cheers,
Cameron Simpson cs@cskk.id.au

Yeah, I know, I thought about that, but sometimes I think people might overlook the casing, so I think sometimes [^\w]+ can actually be more readable. This is a subjective argument though, so :person_shrugging:.

Fair enough.

Regular expressions require such precision that I tend to feel that people should expect to have to read them really carefully anyway :frowning:

So I tend to go for succinct if possible; at least there’s less stuff to parse.

2 Likes

OK so now I’m really confused lol… I tried the \W which I understood was the same as the code below, but neither of them identify and print only the non-alphnumeric character… ie punctuation…

import re
test = 'whats up doc?'
punc = "[^a-zA-z0-9]"
print(re.split(punc, test))

[‘whats’, ‘up’, ‘doc’, ‘’]

I would expect it to give me something more like:
[’ ‘,’ ‘,’ ‘,’?']

I want try and work a solution for this the way I’m envisioning it. I’m probably pushing time through Scott instead of Scott through time, but I have to try it my way lol.

import string
test = 'whats, up doc?'

for x in test:
    if x in string.punctuation:
        print('punctuation ' + x)

this gets me closer:

punctuation ,
punctuation ?

However I need to tie in an index point somehow to the output.

By Brad Westermann via Discussions on Python.org at 15Sep2022 01:26:

OK so now I’m really confused lol… I tried the \W which I understood
was the same as the code below, but neither of them identify and print
only the non-alphnumeric character… ie punctuation…

import re
test = 'whats up doc?'
punc = "[^a-zA-z0-9]"
print(re.split(punc, test))

[‘whats’, ‘up’, ‘doc’, ‘’]

I would expect it to give me something more like:
[’ ‘,’ ‘,’ ‘,’?']

With .split() you’re specifying a regexp for the text which
separates the strings of interest. You probably what .findall().

[…]

However I need to tie in an index point somehow to the output.

.findall() returns match objects, which describe what part of the
string they matched.

Cheers,
Cameron Simpson cs@cskk.id.au

2 Likes

Unfortunately, I now realize you want to preserve all the punctuation, not just return all the words in pig latin. I honestly didn’t look close enough. Splitting is really not going to be the way forward for you. I’m probably responsible for confusing you further :sweat_smile:.

I also see now that you are really wanting to work through this, so maybe this will help instead of confuse :upside_down_face: . Regex, when it returns a match object, allows you to get start and end indexes of your match. You can use this to find words, punctuation, whatever.

import re

pattern = re.compile(r'\w+')

text = "whats, up doc?"

for m in pattern.finditer(text):
    print(text[m.start():m.end()])
whats
up
doc

Now, using re.sub basically does all of this for you and can be used to find all the worlds, send them to a function to alter them, and then replace the original words with your new words. It’s the quickest and easiest solution, but if you want to manually work through it, you can using what I show above.

As Cameron replied, split() uses the regex as a separator.[1] I recommend you looking again into the documentation of split():

There you will find you can get both the separated strings and the separators (interleaved) if you enclose the separator regex in a capture group. A basic capture group is marked just by brackets ().

Try the code below which is just your code above cleaned-up and with the capture group added:

import re

test_text = 'whats up doc?'
punctuation_re = r'(\W+)'
print(re.split(punctuation_re, test_text))

Anyway I think the straightforward solution should use re.sub(). The code was already shown by Matthew. You just need to adjust it for your particular needs:


Good exercise would be to implement the solution using all three ways: re.sub(), re.split() and re.finditer().


  1. …but IMHO the suggested findall() will make the code more complicated. ↩︎

This is a good example of something I learned many years ago in a computer science course: writing code should be the last thing you do when solving a problem.

First you need to work out how to solve the problem. Only then you should start writing code.

Since pig latin operates a word at a time, start by splitting the string into words, piggify each word individually, and reassemble.

When piggifying the words, the rules are:

  1. If the word begins with a vowel (AEIOU and sometimes Y) just append ‘yay’, ‘hay’ or ‘way’ to the end.
  2. If the word begins with at least one consonant (any letter not a vowel), move the entire cluster of consonants to the end of the word and follow with ‘ay’.

For the purposes of this, we can treat “punctuation” as any character that is not a letter. To deal with punctuation, there are only three cases to consider:

  1. The word begins with punctuation: split the word into (punctuation + rest of word), piggify the rest of the word, then join them back together.
  2. The word ends with punctuation: do the same.
  3. The punctuation is in the middle of the word.

Only the third case is tricky. If you are using rule 1 above, you don’t need to do anything special with the punctuation. Just append ‘yay’ as usual.

When using rule 2, you have to decide whether the punctuation counts as
part of the cluster or not. E.g.:

smile -> ilesmay
s-mile -> miles-ay  # keep the 's-' as the cluster
s-mile -> iles-may  # keep the 's-m' as the cluster

Which you do is up to you.

A few more comments.

An alternative to rule 1 is to treat it the same as rule 2, that is, move the vowel + consonant(s) to the end of the word and follow with ‘ay’.

It is sometimes tricky to decide when Y is acting as a vowel but I think a simple rule that is mostly correct is:

  • If the word ends with Y, it is a vowel (e.g. “bay”, “lazy”).
  • If the Y is between two consonants, it is a vowel (“cycle”, “myth”).
  • If the words starts with Y and is followed by a consonant, it is a vowel (“Yvonne”).
  • If the word starts with Y and is followed by a vowel, it is a consonant (or semi-vowel to be technical), (e.g. “yacht”).
  • Anything else, your guess is as good as mine :slight_smile:
1 Like

I would give you 1000 hearts if I could! This is exactly the way my brain has been trying to do this! I write out in a sentence what I need to do and then try to figure out how to code each section of the sentence. Since coding is technically a “language”, this would be my first secondary language and from my miserable year of Spanish one in high school I remember the pain of noun/verb order being backwards from the way I think. So naturally one of my challenges is making sure that my sentence is looking at the problem in the proper order.

I wasn’t seeing the problem this way, and now it makes sense why everyone was suggesting the re.sub() to make this happen. I was thinking of the problem like two rows of teeth with the upper being the split words and the lower being the punctuation and just trying to line them up correctly to fit back together after all the words were modified.

Ok it’s going to take me some time to play with it and make it work but seeing things from a different perspective you hadn’t considered before is one of the things I love about this journey!

I would really appreciate any recommendations of additional textbooks or classes that might be helpful in expanding my conceptual thinking as it would be relevant to coding. I never took calculus but I know functions are a part of that… not sure how relevant that is to coding.

In coding, a function is a subtask, usually named. In your first post you wrote ‘pig_it’; that’s a function.

Well I worked out the majority of the problem:

import re

test = "What would you do for a Klondike bar?"

puncp = r'(\W$)'
punc = re.findall(puncp, test)

wordP = r'(\w+)'
words = re.findall(wordP, test)

Pigs = []

for x in words:
	Pigs.append(x[1:] + x[0] + 'ay')

Merge = ' ' .join(Pigs + punc)

print(Merge)

hatWay ouldway ouyay oday orfay aay londikeKay arbay ?

Now I have to work out how to remove the space between the punc and the last Pigs only without effecting the other words.

My Logic:
I would need to attempt to extract and modify it separately before the string concatenates and then add that as a third variable in the .join()

Or try to work out how to re.sub the modifications back into the original string (or a new variable of the original string so the original is not corrupted by my incompetence LOL.)

I have to say my brain is barely comprehending list comprehension so the prospect of figuring out how to mesh the RE results into another.

I’m assuming I would replace “Pigs.append” with the re.sub options since it still needs to iterate through the “words” but i’m not really clear on the correct formatting for how to plug the result into something, for example:

for x in words:
    re.sub(rawstring match criteria , what to replace it with (x[1:] + x[0] + 'ay'))

ugh… I need to go clean the blood out of my ears now…lol

You’re overthinking it. I posted an example for title-casing the words 4 days ago; just change the title-casing to piggifying.

1 Like

You hardcoded into the program to take into account only punctuation at the very end. It is also ignoring original spaces and creating new ones. You have chosen the least suitable function of the three suggested: re.sub() , re.split(), re.finditer() (just slightly different from re.findall())

As Matthew said, you just need to play with his code and slightly modify it.