How to extract heading from a resume using Python

jameel.dyooti.com · June 20, 2022, 5:50am

I am working on fetching Titles from a candidate resume in Python. I am trying to extract particular information from resume. I have parsed the pdf or doc into plain text. Now I have to identify the headings from it.
Eg:
Objective: some text…
Education: some text…

I want to identify the headings like Objective and Education and the other headings as well while parsing a Resume.
Kindly let me know is there any way to achieve this in Python?

Thanks in advance!!

erlendaasland · June 20, 2022, 7:50am

A simple solution could be to split the file into a list of lines, loop over that list and identify the ones that starts with "Objective: ", "Education: ", etc., and then use, for example, basic string operations to extract the data.

Alternatively, you could look to the re module and use match groups, which will probably end up being a more robust solution, but it will add complexity.

Good luck!

jameel.dyooti.com · June 20, 2022, 8:08am

Thanks for your reply.

I have checked your suggestions it was quite helpful. But in my logic the heading will differ on each resume it may not be same as Objective and Education. So is there any idea to identify headings alone in common.

Any suggestions are welcome!!

erlendaasland · June 20, 2022, 8:51am

I’d suggest to write down the various data input scenarios using paper and pencil, and draw a basic flow chart for the desired solution. After you’ve done that, proceed to converting the flow chart to a piece of code. It is (almost) always easier to implement a solution^[1] when you’ve got it visualised on paper, using arrows and boxes

no matter programming language ↩︎

cameron · June 20, 2022, 10:55pm

By Erlend E. Aasland via Discussions on Python.org at 20Jun2022 09:01:

I’d suggest to write down the various data input scenarios using paper
and pencil, and draw a basic flow
chart for the desired
solution. After you’ve done that, proceed to converting the flow chart
to a piece of code. It is (almost) always easier to implement a
solution^[1] when you’ve got it visualised
on paper, using arrows and boxes

Also, get a few example documents.

I can imagine a plain text document have headers coming as both:

single line header here

paragraph
of text
under the header

and:

header here: paragraph of
text following on

you may want to recognise either or both. And how well that works
probably depends on how your PDF-to-plain-text converter went.

So get a few header+paragraph examples; we may be able to help with
heuristics to recognise them. Don’t forget to obfuscate personal
infomation if you’re working off real world personal resumes.

Cheers,
Cameron Simpson cs@cskk.id.au

no matter programming language ↩︎

jameel.dyooti.com · June 22, 2022, 6:48am

Thanks for your valuable reply!!!

I require the solution for both the scenario which you have mentioned in the previous post.
I have used Apache-tika module for parsing the pdf to text. The plain text will be in the format which was in pdf without bold or any style.
Here are some examples for you:

Name 
Email                                                          
Number
Career Objective: 
xxxxx.....
Education Qualification: 
yyyy....
Experience Summary:  
zzzzzz

Another example

NAME
DESIGNATION 
PERSONAL INFO:YYY 
RESPONSIBILITY: ZZZZ
SKILLS: xxxx

I hope the above examples help you to understand the requirement. Kindly let me know is there any possibility to identify the headings alone in both the formats.

Thanks in advance!

mlgtechuser · June 26, 2022, 5:52am

Jameel, this looks like a classic case for Deep Learning. Any scenario with a wide and random set of parameters is going to be VERY difficult to evaluate and/or analyze with a system of linear or cascading rules (heuristics). For example, a neural network can be taught to grade how green or ripe a banana is. Imagine having to do that with rules. (How much brown is okay, desirable, not enough, or too much? What if it’s a lot of brown but the skin is still green? Or dark-ish yellow with no spots?) Rust on steel or tarnish on silver is the same way–many, many variables. Deep Neural Networks excel at tasks like that.

Neural Networks require a lot of overhead to train from Scratch. You’d need at least 2,000 resumes and someone would probably have to classify them. "However you can take a DNN that was trianed on tens of thousands of resumés, cut its head off, put an empty head on, and teach it about the resumés you tend to see. (Resume formats in a given industry might be more or less consistent on layout, content, key words, etc. so you can train a “headless” neural network for *your industry with far less time and training data than starting from zero.)

Stepping back from the Deep Neural Network “black box”, your application might also work with classic machine learning where you ask ine 9r more statistical algorithms to find features and decide/guess what those features are. Here are some possibilities:

Decision Trees
Random Forests
Nearest neighbors
Pattern matching

There are more that might apply, but that’s the short list of techniques that come immediately to mind.

The first step, as several previous posts have said, is to find samples and break down the types of resume layouts and where you tend to find the different types of information.

Imagine that you have bad eyesight and can only see blurry blobs when you look at resumés. If you look at enough resumés and someone with good eyesight can tell you what the blobs are on each resumé, experience tends to develop where you know that a certain size and shape of blob(s) in one or two or three locations on the page tends to be the personal info. Similarly, the experience tends to be a vertical pattern of blobs of the same width and approximately the same height. You can also measure blob density and asymmetry–OpenCV can make these measurements. (Experience and lists of skills might be the easiest to find. Once those areas are identified, you can delete them from the unknown parts of the resume and the job will be simpler because you’ve reduced the number of variables.)

The next step is to unleash a new model that s trained on analyzing the specific sections:

Experience
Personal Data
Skills
Education
Honors
Professional Associations
etc

“References” is probably the very easiest to find in U.S. resumés because the word is almost always used and the section almost always comes at the end. Most resumés don’t list any actual references, though. They just say “available upon request” instead of listing actual contacts.

I’m going to end off here because I’m on my phone and it looks like this post is running long (sorry, I can’t really tell). I had a lot of fun doing machine learning work in graduate school recently along with some neural networks and follow-up with commercial applications. It’s completely fascinating and I hope you’re successful with your resume deconstructing!

mlgtechuser · August 3, 2022, 7:28am

This tutorial on document scanning with OpenCV in Python was in my newsfeed today and reminded me of this topic. It addresses image acquisition and conditioning but doesn’t cover any document content analysis.

Still, it has some good techniques for things like cropping and squaring the image to extract the document.

Document Scanner with OpenCV Using Video Footage

jameel.dyooti.com · August 3, 2022, 7:44am

Hi @mlgtechuser,

I am glad that you remember my work. Thank you for sharing me the piece of information which will be helpful for my development.

Thank you