Need ideas for a program that finds keywords!

Helllloooo new friends!

I have recently been hired as a software engineer. Wild, honestly. I have a training project to work on that requires me to write a program that finds and highlights keywords in a document.

The trouble is, I’m on a very secure network and I am limited in the libraries that I can access. I have tried manually installing the packages, but it doesn’t seem to be working. Maybe I’m not doing it right.

Any and all ideas on how to work around this are so welcome! I’m new to programming, and I could use some tips!

I’d probably use regex (the re module) to search for the keywords.

1 Like

Trying to “work around” your employer’s security is a good way to get fired and maybe even have charges filed against you. Don’t do it.

Instead you go to your supervisor and say “I need to use these external libraries to solve the problem. Can you give me permission to download them? Using those libraries will save 20 or 30 hours from this task.” (Or whatever your estimate is.)

If you get permission, then you will probably need to talk to the system administrators managing the network, so they can do it for you.

Depending on the company you work for, this might take five minutes, or five months. I have done work for businesses where they had the legal obligation to do a full security and legal/copyright audit of third-party software they used, with potential million dollar fines if they got it wrong. You cannot rush those.

I’ve also worked for companies whose attitude to downloading third-party software was “try not to install any malware, m’kay?”

If you can’t get permission, or it will take too long, then you may need to reimplement some or all of the functionality you need using only the tools available to you.

That might be only the standard library, or any other third-party libraries or in-house libraries already approved for use.

If you explain in more detail what you are trying to do, we may be able to guide you.

  1. Which packages were you trying to install?
  2. What sort of keywords are you looking for?
  3. Where are you looking for them? What sort of documents?
  4. What are you supposed to do when you find them?
3 Likes

I definitely plan to ask, but like you said, I don’t know how long the process of getting those would take. I’ve just been really spoiled as a newbie with all of the libraries at my fingertips.

I’ve tried yake and nltk and a few others, but those are the two I want the most. I need to be able to search .doc, .csv, .pptx, and .html.

They want them highlighted/color-coded based on their classification level. Honestly, I’m just starting with a way to find the keywords before I get to the highlighting.

I’ll give it a shot! Thanks!

If you have not looked into the structure of the Microsoft Office documents yet I have to warn you that they are pretty complicated. Working with these formats would require a huge amount of your work if you do not use existing libraries. In addition I think .doc (contrary to .docx) is a closed proprietary format with no official documentation available.

IMHO it does not make sense to reinvent the wheel instead of using existing libraries. The only exception is if you are willing to invest a huge amount of time (or the employer pay your time) to have it as an excessively large exercise with no sensible practical use. The produced code would be much harder to maintain if you do not use established and maintained libraries.

.csv is a text format. There is no formatting (for color-highlighting) available there.

1 Like

I definitely plan to ask, but like you said, I don’t know how long the process of getting those would take. I’ve just been really spoiled as a newbie with all of the libraries at my fingertips.

I’ve tried yake and nltk and a few others, but those are the two I want the most. I need to be able to search .doc, .csv, .pptx, and .html.

They want them highlighted/color-coded based on their classification level. Honestly, I’m just starting with a way to find the keywords before I get to the highlighting.

I’ve definitely looked into it , which is a bit reason I reached out to the internet. It seems like a nearly impossible task. The good news is that I have a lot of time to complete this. Lol

I completely agree. Really hoping it’s quick process to get some libraries. Any particular recs before I reach out to my IT dept?

  1. Explain the situation. In a normal company during the onboarding phase there should be frequent communication with the new employee.
  2. Start designing the application to be independent of the file formats. One way could be to have an iterator which would provide chunks of normalized text content from the input file. Resolve how to define the chunks: if you need to match text crossing the boundaries of the chunks etc. Then you will need a function (or a method) allowing you to highlight parts in the current (and previous?) chunk.
1 Like