Assumptions are Everything — Why machine learning models have assumptions

Damini Vadrevu
3 min readMay 21, 2024

--

The obvious might not be so obvious.

In data science, assumptions play a crucial role in shaping our analyses and interpretations. They can turn what seems obvious into something far more complex and nuanced. Recently, I was working on a project involving job seekers — specifically, predicting whether they would apply for a job or not. Kind of like a “loves me, loves me not” scenario.

I began with the ostensibly simple idea that increased job posting visibility translates into increased engagement (engagement was assessed by the number of times the apply button was clicked). It seemed intuitive—after all, once a job seeker clicks on a job posting, there is a high probability that they will start the application process. The two variables; views and apply_clicks had a positive Pearson’s correlation coefficient of 0.71— so even the numbers agreed. However, looking into it, I discovered that this presumption wasn’t as absolute as it initially seemed.

Unpacking the Assumptions

Sabrina Carpenter once said that she can’t relate to desperation, but that’s not everyone’s case. What I mean to say is that there are different types of job seekers—those who urgently need it and those who don’t need it right away. The ones who are desperate for a job would apply anyway—they would engage anyway. They’re not assessing the job description thoroughly and carefully reading what the job has to offer them; they viewed it to click apply.

In contrast, someone who has the time to transition or is considering a career upgrade is not viewing to apply; their view actually matters. This means that not all job seekers are on the same level. Sounds pretty straightforward, doesn’t it? But it completely changes our initial dynamic, where we thought views had a direct influence on applications. This is a total shift in perspective.

Accountability

So the model I’m working on predicts the number of times an individual clicks on the apply button, let’s call this variable ‘apply_clicks’. Because we’re assuming all job seekers are equal, the model may not correctly reflect the true degree of interest if it significantly depends on the quantity of job views as a predictor for apply clicks. The number of views may distort the data, implying a higher level of interest than is actually present within the job seeking community sample, since “desperate” job seekers are more inclined to click apply.

I became aware of the oversimplification of my initial assumption — that visibility inevitably results in engagement — after receiving these insights. Although assumptions might simplify complicated reality, they can also mask important details and provide false conclusions. In order to ensure that my conclusions are as solid and correct as possible, I recognize and challenge the assumptions that underlie my work, — I hold them accountable. Holding these assumptions accountable tell the true story the data has to tell. Data doesn’t lie and delulu is definitely not the solulu.

--

--

Damini Vadrevu
Damini Vadrevu

Written by Damini Vadrevu

Humans are complex, and so is our data. I make data science easy to understand here. Welcome!

Responses (1)