On Being an Outlier

JoBo · 10 months ago

On Being an Outlier

JoBo · 10 months ago

Isn’t that a continuation of “why the outlier was culled”?

Not sure I follow, but I think the answer is “no”.

If you control for all the causes of a difference, the difference will disappear. Which is fine if you’re looking for causal factors which are not already known to be causal factors, but no good at all if you’re trying to establish whether or not a difference exists.

It’s really quite difficult to ask a coherent question with real-world data from the messy, complicated reality of human beings.

A simple example:

Women are more likely to die from complications after a coronary artery bypass.

But if you include body surface area (a measure of body size) in your model, the difference between men and women disappears.

And if you go the whole hog and measure vein size, the importance of body size disappears too.

And, while we can never do an RCT to prove it, it makes perfect sense that smaller veins would increase the risk for a surgery which involves operating on blood vessels.

None of that means women do not, in fact, have a higher risk of dying after coronary artery bypass surgery. Collect all the data which has ever existed and women will still be more likely to die from the surgery. We have explained the phenomenon and found what is very likely to be the direct cause of higher mortality. Being a woman just makes you more likely to have that risk factor.

It is rare that the answer is as neat and simple as this. It is very easy to ask a different question from the one you thought you were asking (or pretend to be answering one question when you answered another).

You can’t just throw masses of data into a pot and expect sensible answers to come out. This is the key difference between statisticians and data scientists. And, not to throw shade on data scientists, they often end up explaining to the world that oestrogen makes people more likely to die from complications of coronary artery bypass surgery.

ozymandias117@lemmy.world · 10 months ago

Maybe it’s a crude interpretation, but over controlling for all the the cause of a change, and removing outliers in your data that is training these AI models seem like similar issues when trying to actually understand the data

JoBo · 10 months ago

The data cannot be understood. These models are too large for that.

Apple says it doesn’t understand why its credit card gives lower credit limits to women that men even if they have the same (or better) credit scores, because they don’t use sex as a datapoint. But it’s freaking obvious why, if you have a basic grasp of the social sciences and humanities. Women were not given the legal right to their own bank accounts until the 1970s. After that, banks could be forced to grant them bank accounts but not to extend the same amount of credit. Women earn and spend in ways that are different, on average, to men. So the algorithm does not need to be told that the applicant is a woman, it just identifies them as the sort of person who earns and spends like the class of people with historically lower credit limits.

Apple’s ‘sexist’ credit card investigated by US regulator

Garbage in, garbage out. Society has been garbage for marginalised groups since forever and there’s no way to take that out of the data. Especially not big data. You can try but you just end up playing whackamole with new sources of bias, many of which cannot be measured well, if at all.

ozymandias117@lemmy.world · 10 months ago

You are pointing out specific biases that we already know about. The article you posted seems to posit using the data to find the unknown biases we have as well

JoBo · 10 months ago

It’s asking why don’t we use it for that purpose, not suggesting that there is anything easy about doing so. I don’t know how you think science works, but it’s not like that.