Jameel, this looks like a classic case for Deep Learning. Any scenario with a wide and random set of parameters is going to be VERY difficult to evaluate and/or analyze with a system of linear or cascading rules (heuristics). For example, a neural network can be taught to grade how green or ripe a banana is. Imagine having to do that with rules. (How much brown is okay, desirable, not enough, or too much? What if it’s a lot of brown but the skin is still green? Or dark-ish yellow with no spots?) Rust on steel or tarnish on silver is the same way–many, many variables. Deep Neural Networks excel at tasks like that.
Neural Networks require a lot of overhead to train from Scratch. You’d need at least 2,000 resumes and someone would probably have to classify them. "However you can take a DNN that was trianed on tens of thousands of resumés, cut its head off, put an empty head on, and teach it about the resumés you tend to see. (Resume formats in a given industry might be more or less consistent on layout, content, key words, etc. so you can train a “headless” neural network for *your industry with far less time and training data than starting from zero.)
Stepping back from the Deep Neural Network “black box”, your application might also work with classic machine learning where you ask ine 9r more statistical algorithms to find features and decide/guess what those features are. Here are some possibilities:
- Decision Trees
- Random Forests
- Nearest neighbors
- Pattern matching
There are more that might apply, but that’s the short list of techniques that come immediately to mind.
The first step, as several previous posts have said, is to find samples and break down the types of resume layouts and where you tend to find the different types of information.
Imagine that you have bad eyesight and can only see blurry blobs when you look at resumés. If you look at enough resumés and someone with good eyesight can tell you what the blobs are on each resumé, experience tends to develop where you know that a certain size and shape of blob(s) in one or two or three locations on the page tends to be the personal info. Similarly, the experience tends to be a vertical pattern of blobs of the same width and approximately the same height. You can also measure blob density and asymmetry–OpenCV can make these measurements. (Experience and lists of skills might be the easiest to find. Once those areas are identified, you can delete them from the unknown parts of the resume and the job will be simpler because you’ve reduced the number of variables.)
The next step is to unleash a new model that s trained on analyzing the specific sections:
- Experience
- Personal Data
- Skills
- Education
- Honors
- Professional Associations
- etc
“References” is probably the very easiest to find in U.S. resumés because the word is almost always used and the section almost always comes at the end. Most resumés don’t list any actual references, though. They just say “available upon request” instead of listing actual contacts.
I’m going to end off here because I’m on my phone and it looks like this post is running long (sorry, I can’t really tell). I had a lot of fun doing machine learning work in graduate school recently along with some neural networks and follow-up with commercial applications. It’s completely fascinating and I hope you’re successful with your resume deconstructing!