sábado, 5 de octubre de 2013

The bursting of the big data bubble | mathbabe

The bursting of the big data bubble | mathbabe

The bursting of the big data bubble

It’s been a good ride. I’m not gonna lie, it’s been a good time to be a data whiz, a quant-turned-data scientist. I get lots of attention and LinkedIn emails just for my title and my math Ph.D., and it’s flattering. But all of that is going to change, starting now.
You see, there are some serious headwinds. They started a while ago but they’re picking up speed, and the magical wave of hype propelling us forward is giving way. I can tell, I’ve got a nose for sinking ships and sailing metaphors.
First, the hype and why it’s been so strong.
It seems like data and the ability to use data is the secret sauce in so many of the big success stories. Look at Google. They managed to think of the entire web as their data source, and have earned quite a bit of respect and advertising money for their chore of organizing it like a huge-ass free library for our benefit. That took some serious data handling and modeling know-how.
We humans are pretty good at detecting patterns, so after a few companies made it big with the secret data sauce, we inferred that, when you take a normal tech company and sprinkle on data, you get the next Google.
Next, a few reasons it’s unsustainable
Most companies don’t have the data that Google has, and can never hope to cash in on stuff at the scale of the ad traffic that Google sees. Even so, there are lots of smaller but real gains that lots of companies – but not all – could potentially realize if they collected the right kind of data and had good data people helping them.
Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.
Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.
Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.
Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.
In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?
What’s next?
There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.
Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. And by the way, if you don’t care about that marginal edge, then by all means you should use a type A solution. But you should at least know the difference and make that choice deliberately.
My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.
Personally, I’m looking forward to a more reasonable and realistic vision of how data and data expertise can help with things. I might have to change my job title, but I’m used to it.

No hay comentarios:

Publicar un comentario