Adam Dalal

Why Leadership Builds AI Tools Nobody Uses

Fri, 17 Apr 2026 00:00:00 +0000

The push is almost always top down. Leadership decides the organization needs to be doing AI, sets a target, and the teams below them start building. The problem is that the demand for most of what gets built was never really there. A goal was set and the goal was activity. Leadership is competing internally to show leadership above them how they are pushing the AI agenda, and the result is a race to ship more and ship faster where nobody stops to ask whether any of it is working.

Racing stripes on a Honda Accord

It is like putting racing stripes and a spoiler on a Honda Accord and calling it an F1 car. The optics are there but the engineering underneath has not changed. Quantity becomes the proof of progress, and quality never enters the conversation. And because each team is competing to show their own results, you end up with redundant tools solving the same problem in slightly different ways, cannibalizing each other’s adoption before either one has a chance to prove its value.

Looking for nails

It should not be a case of having a hammer and looking for something to hit. AI is not the answer to every problem, and treating it like one is how you end up with a catalog of tools that nobody asked for and nobody uses.

What that produces at the user level is cognitive overload. When there are too many prompts and too many tools, users do not know which one to trust or which one applies to their situation. Instead of reducing friction, you have added to it. The tool was supposed to make the job easier and instead it made the decision harder, because now the user has to figure out which tool to use before they can even start the work.

Show metrics versus impact metrics

What gets measured makes it worse. The metrics that get tracked are the ones that show activity, such as the number of models built, the number of prompts deployed, and how many teams have adopted AI in some form. These are show metrics because they demonstrate that something happened, not that anything changed. The impact metrics, whether users are actually using the tools, whether the tools are driving the outcome they were built for, and what the return on investment looks like, are rarely defined and almost never tracked.

The question that should come first

I build anyway. By the time it reaches the PM the decision has already been made above me, and I recognize that is the reality for most people in this position. But that does not mean the question should stop being asked earlier in the process, before the work starts and before the resources are committed.

Before any AI use case gets approved, someone should have to answer a few questions. What do you consider the value of the tool we are about to build, and how will it impact the users it is built for? Not in a vague sense, but specifically. What behavior are you trying to change, how will you know if it changed, and what does success look like beyond the number of people who have access to it.

AI is a capable tool. The problem is not the technology, it is the framing. When the goal is to show AI is happening rather than to solve a problem worth solving, the tools that get built reflect that. They exist to be counted, not to be used. And the data will eventually tell you what your users already know.

Author: Adam Dalal

If You Can't Measure It, You Can't Govern It

Sun, 12 Apr 2026 00:00:00 +0000

Most teams treat AI governance as a compliance exercise. Something you bolt on after the model is built, a checklist you run through before launch, a box you check so the right people feel comfortable. That is the wrong frame entirely.

Governance is a design requirement. It has to be defined before anything goes live.

The Autobahn is fast because it has structure

Think about the difference between driving on the Autobahn and driving on a dirt road. The dirt road feels like freedom. No rules, no structure, just go. But you can’t go fast because the road won’t let you. Every bump slows you down and every turn is a risk because nothing was built to handle speed.

The Autobahn has guardrails, lane discipline, and engineering underneath it. That structure is what enables speed. Slow is smooth, smooth is fast.

Most AI deployments are dirt roads dressed up as Autobahns. They move fast in the beginning because nothing is in the way. But without defined operating limits, monitoring thresholds, and response sequences, you are not going fast. You are just going without control.

Governance tells you when something is wrong

I came up in chemical engineering before I moved into product. Before a refinery unit goes live, every operating limit is defined, every alert condition is mapped, and every response sequence is documented. You cannot bolt that on after the fact. The consequences are too significant.

I think about AI governance the same way. At its core this is product thinking. Before a model is built, in the requirements phase, I am already asking what outcome we are trying to achieve, how we will know we have reached it, and how we will know when we haven’t. An AI PM is still a product person. The discipline does not change because the technology did.

In chemical engineering, laminar flow is controlled and predictable, moving in one direction. Turbulent flow is chaotic and hard to course correct once it starts. I think about AI model performance the same way. Governance is how you know which state your model is in. Without defined thresholds and monitoring, you cannot tell the difference between a model performing as expected and one that is quietly drifting toward failure.

If a model starts drifting, producing inaccurate results, or behaving in ways it shouldn’t, you need a defined response. In some cases that means taking it offline to prevent bad outputs from reaching users. But you can only execute that response if you defined it before the model went live.

What happens when nobody defines success

Most teams build the model, ship it, and then figure out what to monitor. The outcome they were trying to achieve was never precisely defined, so there is no baseline, no threshold, no way to know if the system is performing correctly or drifting toward failure.

I have seen this in my own work. My models are API based. I don’t own the application. I can advise, but I can’t always control what gets implemented. The application team decides what to track, what to store, and how to surface it. When those decisions are made without a governance framework in place, you end up with blind spots.

The way users interacted with one of my models changed over time. Originally the model was triggered manually by the user when they chose to run it. It became preemptive, running automatically before the user even began their review. I advised the team to track whether a user fixed a failed result when they saved, and to store a monthly snapshot of how the model was being used so we could do trend analysis over time. They stored only the latest run. Now we can’t measure month over month change. We can see the current state but we can’t see the direction of travel. That is a governance gap, and it exists because nobody defined what needed to be measured before the system was built.

The metric you are tracking is not the metric that matters

Most teams look at usage, adoption, and top line metrics like overall pass rate. Those are not bad metrics. But they do not quantify value and they do not tell you whether the AI is driving the outcome it was built for.

Here is what I track instead.

Which descriptions failed, the user saw the result, and then updated it. That is behavior change. That is the model doing its job. I also track when the result was surfaced and when the user made the change. Now I can measure time to resolution and age the backlog. And I look at which specific questions within a description are driving failures across the firm. If five questions are required and one is consistently failing, that is not always a data quality problem. Sometimes the question needs to be rewritten or users need to be educated on how to answer it. The model just told you something the program team did not know.

That level of insight is only possible if you stored the right data before the model went live. If you stored only the latest run, you have a snapshot. You do not have a story.

Trust = Accuracy + Explainability

A model can be accurate and still not be trusted. I have seen this firsthand.

We built a model using RoBERTa to classify which global legal obligations should apply to a given process at a bank. The model went through internal validation, end user feedback, and passed the minimum success criteria. It was accurate. But usage was low. When we asked users why, they told us they were spending significant time trying to figure out which of the twelve input features drove the prediction. SHAP, a method that quantifies how much each input feature contributed to a specific prediction, had not been implemented and adding it would have required additional engineering effort. The model was right but users could not see why, so they did not trust it.

A sister model using XGBoost included feature importance out of the box. Users could see exactly which inputs drove the classification. They were grateful for it. Same use case, different trust outcome.

Trust = Accuracy + Explainability

Explainability looks different depending on the model type. For traditional ML and deep learning models it means surfacing feature importance so users can compare the model’s reasoning to their own mental model. For GenAI it means surfacing a concise reasoning alongside the output so users are not left with a black box result.

For one of my GenAI models I was evaluating fields to determine whether consistency was present. The JSON output included three things: a consistency flag, which fields were inconsistent if any, and a plain language explanation of why. Users could read the reasoning, evaluate it against their own judgment, and decide whether to act. That is explainability in practice. The model is not asking to be trusted blindly. It is showing its work.

SMEs do not want to reverse engineer a model. They want to look under the hood, confirm the reasoning matches how they would have done it manually, and then move on. Explainability is what makes that possible. Without it, even an accurate model is fighting for adoption.

You cannot govern what you have not defined

AI governance is not a compliance exercise. It is a design requirement.

Before anything goes live you need to know what good looks like and how you will measure it. You need to know what deviation looks like and at what point you act. You need to know who is responsible and what the options are: retrain the model, update the prompt, or take it offline. And you need to store the inputs, outputs, user actions, and timestamps that will make ongoing performance monitoring possible.

The teams that govern well are not moving slower. They are on the Autobahn.

Author: Adam Dalal

How Chemical Engineering Shaped How I Think as a Product Manager

Tue, 31 Mar 2026 00:00:00 +0000

I had never set foot on a refinery when I volunteered for a construction project at Motiva’s PARCEP facility in Port Arthur. By the time we were done, I was the lead engineer for my discipline on that unit.

My job was to ensure that every instrument in that unit was tested, verified, and handed off correctly. We were responsible for thousands of instruments and every one of them had a function, a tolerance, and a failure mode. If a vessel ran too hot, the system automatically brought in cooling water. If a tank level climbed too high, the outlet valve opened. The unit was designed to correct itself, but only if every instrument was calibrated and wired correctly. My job was to make sure it would.

When the metric lied

During my testing, I noticed something that the re-test rates were high. We were hitting our quota on paper, but a significant portion of the work was being redone. I dug into it and found the root cause; we were testing instruments that mechanical had not yet cleared. The pipes hadn’t been tested, which meant the equipment had to be disconnected, tested, and then reassembled once mechanical was done. We were essentially testing the same instruments twice. I pulled the schedules for every department that had to complete work before we could test. I built a system that only released instruments for testing once the upstream work was done. Testing was slow at first, but then it picked up, and the result was the retest rate dropped. We stopped doing the same work twice and started making real progress.

I didn’t know it at the time, but that is exactly how I approach product work today. Before I build anything, I map the workflow. What is the input at each stage, what is the output, what are the constraints, what happens downstream if something goes wrong. The retest problem looked like a testing problem, but wasn’t. It was a sequencing problem that only became visible when I looked at what every other department needed to complete before my team could do its work. I only found it because I was looking at the whole system, not just my piece of it.

Same instinct, different system

I left chemical engineering and moved into data and product work. The problems looked different but the way I approached them didn’t change. Years later I was brought into a conversation about where to integrate an AI solution into a digital review workflow. The manual process was straightforward. Reviewers read descriptions pulled from a random sample, determined whether required questions were answered, and recorded a pass or fail. The team was deep in discussion about how the workflow should be designed. I took a step back.

I started mapping each stage. What was the input, what was the output, what were the constraints at each step, what metrics would tell us whether the outcome we wanted was actually happening. Nobody had thought to track how long each stage took. There was a discussion about the exception workflow, what happens when a reviewer flags something as failing and another reviewer overturns it, and how that should be designed. Both are being addressed in the next iteration, one to improve visibility into where the process slows down, the other to ensure the exception workflow is designed correctly from the start. And it hit me, it was like I threw on my chemical engineering hat once again, trying to solve system problems, just like I had done on the refinery construction project.

With the AI model, we could now automatically target the descriptions that are failing, while taking a smaller sample of passes to verify the model was working correctly. The workflow did not change, but I was targeting the attention to the highest priority items.

What it taught me

Chemical engineering didn’t teach me how to build products, it taught me how to think about systems. What goes in, what comes out, what happens at each stage, what breaks when something upstream fails, and what you need to measure to know whether the outcome you want is actually happening. Studying economics sharpened how I think about tradeoffs, incentives, and user behavior, but the instinct to map a system before touching any individual part of it, that came from what chemical engineering taught me.

If you are a PM, start with the workflow before you touch the solution. Map the inputs and outputs at every stage, understand the constraints, and think about downstream consequences. Ask what you need to track to know whether the outcome you want is actually happening. The solution is one component in a system, so treat it that way.

Author: Adam Dalal

If You Build It, Will They Come? - Why Your AI Model's Explainability Might Be the Problem

Tue, 10 Mar 2026 00:00:00 +0000

Why did the model make that prediction?

I shipped a deep learning model to internal users, 85% accurate, validated against a test set and signed off by end users. The users reviewing its recommendations were the same people who had been making those decisions manually for years. I presented it to stakeholders, walked through the problem it solved, shared the performance metrics, and explained how it was validated. By every measure I was tracking, we were ready.

After it went live, adoption sat at 20%. I asked myself why people weren’t using it. So I talked to several users. They asked me: why did the model make that prediction? They told me they had spent significant time trying to understand how the inputs affected the output, but couldn’t reconcile it. The model had 12 inputs. They were overwhelmed trying to understand which ones actually drove the prediction they were looking at. They ultimately told us they were not going to use a model they could not reverse engineer. These were experts who had been doing this work manually for years.

Explainability was never part of the conversation

Explainability is just as important as accuracy, but it never came up once, not in user research, not in stakeholder reviews, not in model validation. That’s because every conversation I had with users was designed to measure one thing: do you agree or disagree with this prediction? That feedback loop only ever returned accuracy. I didn’t know what I wasn’t asking.

Model Selection Is a Product Decision

Low adoption was the symptom. The root cause was that explainability was never considered.

For this deep learning model, extracting what drove a specific prediction is possible but not straightforward. Techniques exist, but they require additional effort and the results are not always clean or intuitive for end users. A traditional ML model with native feature importance tells you how much each input contributed to the prediction, out of the box. When explainability is a requirement, that difference has a real cost.

This is where it gets uncomfortable. What if the black box model is more accurate than the interpretable one? That is a real tradeoff, and there is no universal answer. But it is a product decision, not a data science decision. The right model depends on your user, not just your benchmark.

If explainability is a user requirement, and it should be, then model selection is not purely a data science decision. PMs need to be in that conversation, not just signing off on accuracy thresholds after the fact. The type of model selected should not be determined by performance metrics alone. It should be weighed against a simpler question, one that should be asked during user research alongside questions about workflow and model performance:

“What would you need to see from this model to not only use it, but trust it?”

Author: Adam Dalal

Stick a Fork In it

Sun, 25 Jan 2026 00:00:00 +0000

Decide on where to eat in 10 minutes or less

I have been working on a project on lovable.

I built a small tool to solve a problem I kept watching happen — and it turns out I wasn’t alone.

Why This Exists

A few months ago, I started paying attention to how my friend groups decide where to eat. Here’s what it usually looks like:

Someone sends a message: “Where do you want to eat?”
The group chat goes quiet.
Then someone says “I don’t care.”
Then someone else says “Anything but sushi.”
Then a third person pulls out their Iphone notes app with restaurants that they still have not been to but want to.
Now you’ve got five people, restaurant lists on notes app, Instagram, Google Maps and zero consensus — and it’s already 7:15.

What should take five minutes stretches into 30. And somehow, you still end up at the same place you always go. I wanted to make sure this wasn’t just my experience. So I texted friends, brought it up in conversations, and watched it play out in real time. The same patterns kept coming up:

Ideas are scattered across Notes apps, Instagram posts, screenshots, and group chats Constraints like price and distance surface after people have already anchored to something No one wants to be the one to decide, so the loudest voice wins

One friend put it best: “We spend more time deciding than actually eating.”

What I Built

A shared session where everyone can submit restaurant options and vote. The organizer creates a session and shares a link. No accounts, no login. Participants open it, add their picks, and rank the options. The app surfaces a winner. I scoped it down deliberately. The hardest part of this problem isn’t finding a good restaurant — it’s getting a group of people to agree on one. That’s what I’m solving first. Reservations and menus stay offline for now.

PM Decisions

A few deliberate calls shaped what this became. Preference tiers over simple voting — Favorites, Top Choices, Okay With, Dislike — because the goal isn’t just picking a winner, it’s finding where the group actually overlaps. A single vote per person tells you what people want. Tiers tell you what people can live with. That’s a different and more useful signal when you’re trying to get five people to agree on dinner.

No login, no accounts. Friction kills group tools before they start. If someone has to create an account to join a session, half the group won’t bother. The session link is the access. No reservations, no menus — yet. It’s tempting to build the whole thing. But the hardest part of this problem is getting a group to agree, not finding a good restaurant. I scoped to the hard part first and left the rest for later.

What I’m Measuring

I’m watching three things right now: Drop-off point. Where do people abandon a session? If it’s at the voting step, the UX needs work. If it’s at the share link step, the onboarding isn’t landing. Session completion rate. A session that gets created but never voted on is a failed session. That number matters more to me than signups. Time to decision. The whole premise is 10 minutes or less. If real sessions are running 20–30 minutes, the product isn’t solving the problem yet. Soft launching with small friend groups — people who’ve lived this problem and will give honest feedback. If you’ve felt this pain or have a take on how your group handles it, I’d love to hear from you.

Built with Lovable. Problem validated the old-fashioned way, by watching people argue about dinner.

USA Real Estate - Predicting Sales Price

Wed, 05 Oct 2022 00:00:00 +0000

Predicting Price in USA Real Estate Market

In this exercise, we will use regression techniques on housing market in the US. The dataset used is a Kaggle dataset which can be found here.

We will go through data cleaning, investigate distributions, exploratory data analysis, and then finally modeling using several regression models.

The dataset consists of these fields:

Price
Bed
Bath
Acre Lot
House Size
Address
Zipcode
City
State

Exploratory Data Analysis

The dataset consists of 100,000 rows and 11 columns. After analyzing the dataset, it seems like there are buildings included in the sale. For example, the max beds in the dataset is 86 and max baths is 56. For this exercise the data was cut off at 6 bedrooms and 5 bathrooms.

After removing other outliers using IQR, the resulting dataframe was a little over 44,000.

The top 5 zipcodes in terms of average price were primarily located in MA(2),CT(1),NY(1), and VI(1). For top 5 zipcodes in terms of average $/sqft were in MA(4) and CT(1). I would have figured that New York City area would have at made it to the top 5, but according to the data, it did not.

Zipcode and City were transformed using Category Codes, which labeled them as numbers. I am not sure if this was the right approach, but when I used one hot encoder, the processing time increased dramatically. This is due to the encoder increasing the number of features to 4000+.

Box Plots

The box plots (below) show that the features with the most outliers are: price, acre lot, and house size.

Histograms

The histograms (below) show that the features that are left skewed are: price, acre lot, and house size.

Heatmap

Modeling

The three models used were:

Decision Tee
Random Forest
Extra Trees

Extra Trees performed the best, but overall there might be overfitting due to the score for all 3.

Decision Tree

The mean absolute error: 7602
Score: 0.956

Random Forest

The mean absolute error: 7778
Score: 0.956

Extra Trees

The mean absolute error: 5590
Score: 0.975

Conclusion

The score for Decision Trees, Random Forest, Extra Trees suggests that it could be overfitting. I am not sure if using the category codes was the best move, but when I tried doing the one hot encoder, using regression models took a lot of computational time due to the 4000+ features it had to go through.

In terms of scores, the Extra Trees regressor was the best, then Random Forest, then Decision Trees.

In later iterations, I will have to see if there are ways to address this and also overfitting.

Predicting Attrition in Healthcare Industry

Sun, 18 Sep 2022 00:00:00 +0000

Predicting Attrition in Healthcare Industry

In this exercise, classification models will be used to answer what factors lead to employee attrition in the Healthcare industry. The dataset used is a Kaggle dataset which can be found here.

We will go through data cleaning, investigating distributions, exploratory data analysis, and then finally modeling using several classification models.

The link to the jupyter notebook is located here.

Exploratory Data Analysis

The dataset consists of numeric features such as Age, hourly, daily, and monthly rate, as well as monthly income. Also has features that deal with years at current company, years worked, years in current role etc… The categorical features consist of department, education, job level, job involvement, gender, marital status.

The data consists of no missing values, so there is no need to impute missing values or remove the observations. As far as numeric columns, there are outliers, and will use Standard Scalar before modeling.

The data consist of 1,676 rows and 35 columns, and the dependent variable, Attrition, consists of 12% attrition, 88% no attrition.

Below shows the correlation among the numeric features. The heatmap shows that the following features are positively correlated with each other: Age, MonthlyIncome,TotalWorkingYears, YearsatCompany,YearsInCurrentRole, YearsSinceLastPromotion,YearswithCurrManager.

There were other plots made that looked at the interactions of various features, for those, please take a look at the jupyter notebook (link above).

Modeling

The data was transformed using Standard Scaler and One Hot Encoder. The models used in this analysis were:

Logistic Regression
- With and Without Recursive Feature Elimination (RFE)
Random Forest
Extra Trees
Gradient Boosting

GridsearchCV was used for hyperparameter tuning. The main goal should be to reduce False Negatives, and by that i mean we want to reduce the scenario where the model predicted “No Attrition” and the data showed “Attrition”. The reason why I feel this should be main goal, is that as an employer, I would want to know before an employee leaves as I do not want to be short handed and have enough time to find replacement.

Logistic Regression

Using Logistic Regression, I performed four different models, using a combination of scaled/non-scaled data, and with and withou RFE Results are shown below:

Model	Accuracy	AUC	False Negative
Logistic Regression - Scaled Data	0.934	0.972	0.052
Logistic Regression - Non-Scaled Data	0.922	0.945	0.060
Logistic Regression - Scaled w/RFE Data	0.903	0.909	0.076
Logistic Regression - Non-Scaled w/RFE Data	0.928	0.967	0.050

It seems like Logistic Regression is overfitting. We will compare this later on with the other models.

Random Forest

I decided to see how RFE worked with Random Forest,and the results were negligible for the most part. The false negative was lower with the RFE, but still was higher than Logistic Regression. Because Random Forest introduces bagging and random sampling of features, it leads me to believe that Logistic Regression indeed has an overfiting problem

Model	Accuracy	AUC	False Negative
Random Forest - wo/RFE	0.901	0.950	0.092
Random Forest - w/RFE	0.907	0.950	0.082

The Confusion Matrix: For Non-RFE:

For RFE:

Extra Trees

The Extra Trees and Random Forest w/RFE yielded similar results

Model	Accuracy	AUC	False Negative
Extra Trees	0.907	0.941	0.082

The Confusion Matrix:

Gradient Boosting

Gradient Boosting does perform better than Random Forest and Extra Trees, but also known to overfit.

Model	Accuracy	AUC	False Negative
Gradient Boosting	0.922	0.954	0.060

The Confusion Matrix:

Conclusion

While Logistic Regression did perform better in terms of accuracy and lower false negatives, I believe it is due to overfitting. As far as Random Forest, Extra Trees, and Gradient Boosting, the latter performed the best in terms of accuracy and lower false negatives.

Enhancements are needed, especially in terms of reducing overfitting due to low number of observations. Also, need to investigate more into how I can improve the recall which is very low for the non-Logistic Regression models.

In terms of what I wanted to accomplish, I did learn about various methods like Recursive Feature Elimination and efficient way to use column transformers. I did look at various sites for help, and is linked in the jupyter notebook.

AirBnB Open Data EDA

Mon, 12 Sep 2022 00:00:00 +0000

AirBnB Open Data Exploratory Data Analysis

TBD

Why Do Good Employees Leave?

Fri, 06 Jan 2017 00:00:00 +0000

Why Do Good Employees Leave?

This analysis looks what which employees with a high evaluation score will leave the company. The dataset was taken from Kaggle, which is located here. The dataset does not need any Data Munging, so this analysis will cover Data Analysis on what features contribute to good employees leaving.

Overall analysis of employees that Leave

It is found that most employees tend to leave around year 3 and decrease with each year, with bulk of employees leaving around years 3 to 5. Evaluating the distribution of average monthly hours worked by employees that worked there between 3 and 5 years, we see two clusters. One that worked less than 160 hours and one that worked over 225 hours. Thus, either the employee was underworked or overworked. The median satisfaction level for this group was 0.4, and the last evalation score was 0.52 (median). If this group was purely working less than 160 hours a month, one could surmise that they did not work as much and thus got a low evaluation score, but there is a group that worked more than 225 hours a month (45 hours a week or more) and got a low evaluation score. One would need to understand the metrics behind evaluations to understand what is driving the data. Surely, these group of people are not satisfied with their jobs and decided to leave or were forced out.

Evaluating why good employees leave

Among those employees that left, 25% of the population got 0.52 on their evaluation. Half the population got 0.79 on their evaluation. The mean evaluation is 0.71. Employees with evaluation scores above 0.71 resulted in a count of 7606. Among those, the employees that left were 1893. This is about 25% of the employees.

Analyzing the relationship between salary and employees leaving, majority were low/medium salary. About 98% of the employees that left were low/medium salary while 92% of all employees were in this category.

The histogram (above) average monthly hours for employees that left show that they generally worked more than 200 hours (45 hours a week). Histogram for average monthly hours for employees that stayed show that there was a uniform distribution for hours worked between 125 and 225 hours. So not only do employees that left get paid low/medium salary, they also on average worked 50 hours more than employees that stayed.

Analyzing salary and whether an employee has been promoted in the last 5 years, results show (below) that 99% of employees that left, were not promoted in the last 5 years, even with an evaluation of 0.71 and higher. Bulk of those employees were low and medium salaries. Looking at the satisfaction level , the satisfaction was lower for by at least 20% for employees that were not promoted in the last 5 years.

Thus, employees with at least 0.71 evaluation rating and not promoted in the last 5 years resulted in them leaving, probably to find greener pastures.

Predicting Employees that Left (Classification)

Hypothesis is that employees above 0.71 that have not been promoted will leave the company. Fortunately most of the left column matches up with this, so Using Random Forest/Extra Trees, there is no need for scaling of the data.

features used: After including all features in the Random Forest model, salary, promoted, work accident, sales were not significant features and thus ommitted. Features used in this analysis are:

satisfaction level
number of projects
average monthly hours
time spent with the company

Random Forest

Random Forest has the highest accuracy, with 9 having Type I error and 17 having Type II error.

Extra Trees

Extra Trees has a higher Type I error and the same Type II error as Random Forest. Although overall accuracy is 0.001 less than Random Forest.

Gradient Boosting

Gradient Boosting has the lowest Type II error of all three, though the accuracy score is .01 lower than Random Forest model.

Classification Model Conclusion

If lowest Type II error is the goal (ensuring that model predicted that the employee stayed, but the employee left), then Gradient Boosting Model is the best one, even though accuracy is lower. Since the accuracy is 0.01 lower, that is pretty negligible.

Predicting Satisfaction Level (Regression)

Instead of predicting which employees left, let’s see if we can predict the satisfaction level of the employee. Ultimately that should be the goal, happy employees will perform better and stay. This will be tackled later.

Conclusion

In determining why employees leave, the obvious reason is that their last evaluation score was low, below 0.5, which was about 25% of the sample. The question is, why do good employees leave? From data analysis is that for employees that scored at least 0.71 on their evaluation, 99% of the employees that left did not receive a promotion in last 5 years.

From prediction, Random Forest, Extra Trees, and Gradient Boosting all had high accuracy. For the highest accuracy Random Forest did the best, and for lowest Type II error, Gradient Boosting was the best. That being said the difference between all three is minimal.

Short-Term Rental Platform vs Long-Term Tenant: Evaluating Expected Profits

Sun, 11 Dec 2016 00:00:00 +0000

Introduction/Summary

Short-term rental industry as of late as been a big focus for the sharing economy and local and state government. People who have been putting their dwelling (Hosts) on platforms such as AirBnB and HomeAway are able to make money by making use of empty room or perhaps their whole house or apartment on these platforms. Local and State governments are taking notice because these platforms are bypassing the local Hotel Occupancy Taxes that local Hotels have to pay. No doubt the Hotel Lobbyists are putting pressure like how Taxi Lobbyists put pressure on officials with Uber.

Property Management companies are also using these platforms because they can generate more profit than long-term leases, depending on what areas (i.e.Santa Monica, New York City). Sharing Economy tends to create a demand bubble, causing prices to increase. Due to this reason as well as bypassing Hotel Occupancy Taxes, many local and state officials are banning listings that are entire units. Austin, Santa Monica, San Francisco all have put severe restrictions into number of days that the unit can be listed, requiring permits, and/or banned entire unit rentals

The Jupyter notebook associated with this analysis can be located here.. The .py script files for scraping and data munging is located in the github repo which is located here..

Scenario

There are two scenarios entertained in this analysis:

Does putting extra room in a two bedroom unit on STR or long-term renter generate higher expected profit? There will be cost associated with operating a STR versus long-term renter, such as time spent on guest management.
Assuming that putting entire home/unit on STR platform has no legal issues, which units will generate higher expected profits?

Methodology & Data Sources

Using AirBnB data and Zillow data, I will use a machine learning algorithm to predict Airbnb prices by month by neighborhood and by bedrooms to get a sense of how pricing differs by neighborhood, bedrooms and month (seasonality). By calculating the average yearly price and occupancy for each neighborhood, I can evaluate rental listings on Zillow to see expected profits. Model I being for a private room, and Model II for being the entire home/apt for studios, one bedroom, two bedrooms, three bedrooms, and four bedrooms.

The pricing for Model I will be applied to the two bedroom homes that are in the Zillow dataset. Two bedrooms will be analyzed because if I were to occupy one of the rooms, the other room can be put on STR platform or available for a long-term tenant. The baseline will the latter, so the expected profit would be the expected revenue less the yearly cost for that apartment. Of course there will be opportunity costs associated, such as furniture cost for the room, utilities, guest management (answering questions, meeting the guest at check-in/checkout), and maybe there is a brokerage fee.

For Model II, the algorithm will be applies to the Zillow data based on neighborhood and bedrooms.

There are terminology that the Hotel industry and STR industry also uses, such as the following:

1) ADR: Average Daily Rate ($ per night)

2) Occupancy: It is the amount of days occupied in a given time frame (i.e Month, Year)

3) RevPAR: Revenue per Available Room, it is ADR*Occupancy, or monthly or yearly revenue / (amount of days in the time frame)

The data sources used in this analysis were:

1) Zillow

Links were manually entered by finding the URL for specific neighborhood in Zillow.
Scraped Zillow Listings for most of Manhattan and for most of Brooklyn.

2) AirBnB

The data for AirBnB was downloaded from http://insideairbnb.com/get-the-data.html.
It is uncertain how the data was pulled, as a lot of contributors were assocated with this website.

For more detailed information on Zillow and AirBnB methodology, please see the jupyter notebook.

Trends

There is a definitely sign of seasonality for NYC. Generally, May, June, September, and October are the peak months. The following are the occupancy %, avg price, and RevPAR for 2015. Because the scrapes from insideairbnb.com were not that consistent, they missed May 2015, which shows occupancy and ADR lower than June. On Average, prices were flat for studios, one bedroom, and two bedrooms, while saw some price fluctuations for three and four bedrooms. Thus, on average, hosts do not implement dynamic price model. Hosts should price according to demand, such that in peak seasons, they increase prices, and in low season, they drop prices if their goal is the maximize profits. The graphs show min and max which are the horizontal lines at each point.

2015 Data

Model I

Predicting prices for Private Rooms yielded poor results. While the MSE for Extra Trees was lower than Linear Regression and Gradient Boosting, and thus higher correlation, there is an pattern of correlation in the errors when the residual was plotted. The variables were Number of Reviews, Rating, Month, Number of Competitors, Occupancy %, and Neighborhood

The MSE: 155

R-square: 0.74

The residuals show a pattern, which is an indication of a missing feature.

Model I Results

There are patterns in the residuals indicating that there is under-fitting. There seems to be one or more features that are missing from the model. It is harder to explain the price differences among private rooms as number of bedrooms isn’t a feature, since there is only one bedroom. By evaluating the feature importances, number of reviews, neighborhood, and rating are the top 3 features. If a person is looking for a room, it is very likely that he/she are looking for a credible listing, one with highest reviews, rating in the neighborhood they are wanting. Since these people are most likely traveling alone, they looking to minimize the cost of a room, thus are shopping for a Private Room and not an Entire home/Apt. Occupancy% and Month are not factors, meaning they are not pressured by seasonal factors. Two features that are not available are photo quality and listing quality. These two are important features when deciding on a listing.

Another model was done on 2016 data, given that the listings csv had more columns to choose from such as whether a host is a superhost, whether the unit is instant bookable, host response rate, and host response time. The model did not perform any better, so the conclusion is that there are other features that might explain the price differences, such as listing quality, and photo quality.

When applying prices to Zillow, yearly averages will be used (per neighborhood) to gauge the profitability per Zillow listing.

Model II

Model II focused on entire units using the same features as Model I, and also included number of bedrooms. Like Model I, Extra Trees performed better in terms of MSE/R-squared than Linear Regression and Gradient Boosting.

MSE: 1027

R-square: 0.89

The residuals show a slight hint of a correlation, which is an indication of a missing feature, especially in the lower price ranges.

Model II Results

Linear Regression had the highest MSE/lowest correlation, following by Gradient boosting, and then Extra Trees. Extra Trees is the preferred model given it yielded the best results. Extra Trees uses resampling of observations, and also features, so the bias is lower but the variance is higher than other Tree algorithms. There does seem to be a trend in the residual plots meaning there is a feature that is not present in the model. This could be the photo and listing quality as previously stated, they are important features on listing decision by guests.

Findings/Results

Extra Trees was the model of choice given that is takes random sample of the observation and features. This results in lower bias, but it can lead to higher variance than other tree-based models. There is also indication of patterns in the residuals indicating missing features, such as photo quality and listing quality, which can make or break a listing. Unfortunately, do not have those features available.

When applying yearly avg price by neighborhood to Zillow rental listings, the preferred area for private room and entire home/apt is in Midtown. This is because it is a high tourist area, but low demand for locals to live there. Thus, price per sqft is cheaper than lower Manhattan. The spread between the price that a host can charge on a STR platform versus the rental price per day (rental price divided by 30) is higher than other neighborhoods in Manhattan. Clinton Hill faired pretty well for Brooklyn, as it is less costly than DUMBO, but pretty accessible to lower Manhattan.

Link for Tableau showing visualization for Private Rooms

Future Considerations

Getting calendar data that is scraped regularly, especially if it includes peak months, would yield better occupancy levels. Also this would result in more efficient price computation. Tidying up the Zillow scrape to where it scrapes coordinates and doorman (would be used as a proxy for more luxurious apartments) would be the next phase after this analysis.

It would be useful to download future Airbnb calendar scrapes to give a better average price per month, which should give more accurate results for expected profits.