Unbalance Dataset

Hey everyone.

I am working on personal project that can change the face of the airline industry.

Let’s make it simple. Dataset of 63k rows, 7 columns. 6 significants characteristics to me target value. 2 two target values ( Show or no show). For instance, I want to build a model that is predicting if a person will show or not show at a restaurant knonwing some characteristics ( Type of guest, party size, visits completed, day, hours, month). However, I have 53k rows for reservations that are qualified “Done” against 6k rows for my no show. I built random forest and regression, giving me shit results. Why? How should I deal with that? I have something big, but my model… Any help would be appreciated!

In order to have any hope of solving the problem, we must first make sure that it could make sense. So, in your own words: if we know these things, why should they help us to make the prediction? Do you think that a different “type of guest” (what “types” do your guests have, anyway?) would be more or less likely to show up? Why would this matter? Similarly for the data.

Also: I don’t understand why being able to predict “if a person will show or not show at a restaurant”, should be helpful in “the airline industry”.

Hey Karl,
Sorry if I provided any confusion. My english is not perfect also…

Let’s dig into it deeper and clear confusion.
First, thanks for trusting and replying to me. Hope I will make it clearer.

1.I mistaked Airline industry and restaurant industry. I am working on the restaurant industry, which is similar to the airline industry. Why? Both of those industry is trying to fit people into a defined room (plane or restaurant) and have to face “no-shows” meaning that people book, but are not coming. In response, both industries are “over-booking” to minimize risks of losing a seat, and maximize revenues.

  1. My goal here is to create a prediction model that will anticipate/predict those no-shows, depending on my characteristics, to set an optimal overbooking level.
    After some descriptive analysis, I could gather some characteristics to analyse my data.
  2. Type of Client: Regular, VIP, Member
  3. Hours: 5 to 6,6 to 7, 7 to 8, 8 to 9, 9 to 10, 10 to 11, 11+
  4. Completed visits : 1, 2, 3…
    4.Size of the table : 1,2,3,4… meaning how many people booked.
  5. Month : January…
  6. Day: Monday…

Then, two target values; Show or no show.

Then, a lot of questions are coming because my model is bad at predicting those no-shows. Is it coming my characteristics? Settings of my model? Unbalance dataset?

Would be more than happy to reply to your questions.

I’m not familiar with predicting these things so I have questions. Let me refer to the airline industry. Are you ignoring variables of why people do not show up due to:

  1. Very bad weather. (Ice storm, tornado, flood, etc.)
  2. Their car broke down.
  3. They got sick and cannot make the flight.
  4. A meeting they were going to got cancelled.
  5. A relative is very sick and they have to care for them.
  6. They thought they set an alarm clock but didn’t.
  7. They hit snooze button on their alarm clock too many times.
  8. They are just bad at showing up on time. (I know a person like this. They are chronically late for everything. I’m sure they will be late for their own funeral.)

Hey Chuck.

Thanks for your time.

For most of your questions, most of the time, it’s impossible for me to know those variables (relative sick, forget to put a reminder, meeting….)

I am putting myself in the shoes of a restaurant. How can they, from a reservation, having variables that can predict a no show.

Two interestings variables you just added that might be possible to extract.

  1. weather I can extract my no shows/shows for 2023, and compare it to the weather at that time. I am sure we will find very interesting stuff.
  2. People that are always late. Correct. Might be able to pull this out from the no show/late cancellation count.

Other than that, impossible for me to grab those informations.

My ultimate goal is extract my reservation report for x day and being able to predict the no shows with the best characteristics. I am not sure month and day are relevant on my model.

Looking forward to continue this discussion,

AlainP

Isnt this a statistical or sociological question?

I take it that your question is “Why does my model perform so badly?” I believe it’s impossible to give a general and useful answer to that, since there are just too many possible reasons why that might be so. The code could have bugs. The evaluation metric could be wrong. The test and training sets may too small or the training set may not be representative of the test set. The way you ran the training could be wrong/unsuitable. Your hyperparameters could be wrong or not suitable. There could be too many (irrelevant and/or noisy) or too few (relevant) features.
(Did you try some simple linear regression models first, based on different partitions of the features, to get baselines? What makes you think a random forrest would be a good model for the task?)

It’s also not at all impossible, I think, that “show/non-show” is essentially a random process, given the data…

There are general, rule-of-thumb approaches and various methods to debug and analyze any or all of these issues, but it’s also not really possible to do so here, since we’d need to see your code. Also, none of these issues seem to be specifically related to Python!

If you want concrete help then one way could be to go to Kaggle and try to find either a problem that is similar to yours (binary classification based on some feature set or some regression task) and then participate in a competition, or to find discussions about random forrests there (plus lots of sample code provided by others) and see if others struggled with similar issues.

2 Likes