Everybody wants to be Albert Einstein….

Normally when I write, there is a mantra in my head – “how can I show that what I’ve said is true?”. This is the product of writing in an academic environment. Everything has to be referenced, every statement qualified.

This blog will not do any of that. All of this is shooting from the hip and entirely based on opinion rather than on an evidence-based perspective. Much of what I say will have been said by others. I’ll either know that and hence I’ll apologise in advance for not referencing them or I won’t know because I didn’t do the research, so I’ll apologise in advance to them for not bothering.

In any case – Caveat Lector


We have a problem.

It’s not clear to me if it’s just for the inter-disciplinary Scientific research, or for inter-disciplinary research in general or even academic research in general. It certainly envelops Machine Learning and it has its influence on Bioinformatics. It’s certainly shaping the debate about what we exactly mean by Data Science. When I get to the punchline I’ll talk about Data Analysis and I’ll mean that in as general a way as possible.

It summary it can be defined thus:-

Everybody wants to be Albert Einstein.

Nobody wants to be Marie Skłodowska-Curie.

To be clear, this is nothing to do with their personal lives – given the choice I’d much rather live a full and long life, dying relatively quickly in old age than to lose a spouse in a horrendous accident and to die horribly in late middle age. Likewise, this has nothing to do with gender either (though it’s tempting to be drawn to that argument).

This is about the way they did their research.

Albert Einstein was a theoretician. He did gedanken – thought experiments that provided incredibly useful insights over roughly a decade into the photoelectric effect, the relationship between moving frames of reference (which led onto Special Relativity and the extraordinary insight of that equation) and then Gravity (the General Theory of Relativity). He encapsulated ideas into simple (some would say beautiful – though that type of vocabulary is one I shy away from now) mathematical equations (including of course that one).

Marie Skłodowska-Curie was an experimentalist. If she was working in Molecular Biology she would be called a wet lab researcher. Her major discoveries was the extraction of the elements Polonium and then Radium. Two chemical elements that provided a ready source of radioactivity. Because of it, we gained insights into the structure of the nucleus.

To do that, Marie Skłodowska-Curie and her husband had to extract them from a tonne of pitchblende.

That is not a metaphorical tonne. This was an actual heavy-as-a-car kind of tonne.

In addition, pitchblende is nasty unpleasant mucky stuff.

Their “lab” was not a well kitted out lab, even by early 20th Century European University standards.

It was a shed.

It was shitty work and they managed to extract milligrams of Polonium and Radium from all of that. But it was enough to go by.

They both did much else, but these are probably the things they’ll be remembered most for within Science.

Einstein discussed his ideas with friends and spent a great deal of time thinking very very deeply while staring out the window. Skłodowska-Curie sifted through a tonne of crap.

Like any other great intellectual shift, Einstein and Skłodowska-Curie couldn’t have done what they did without working on the basis of other’s work. With respect to Einstein, this is a truth that is often ignored. In particular, Einstein’s insights depended on much of the experimental work done in electromagnetism in the previous century. In particular, it was the negative experimental result of Michelson and Morley that showed that there was no such thing as the aether (a medium for light to propagate) that was key here. Michelson spent decades working on that experiment, and believed he was a failure to his dying day because the idea of the aether was so appealing, so obvious, so beautiful from the theoretical perspective.

I’m not here to have a pop at Einstein. He seems to have been in the nice guy camp of great theoretical physicists (the other being the utter dick camp). I don’t think he would have denied the importance of the experimental work to his thinking.

The trouble is this – as I said before, analysts of data want to be Einstein. They want to be the one who had the insights based on the prior work. Instead of a simple set of equations, the analogue is the algorithm. Instead of the prior insights that have come from experiment, they want data.

Curiously there is a great deal of selection here about what’s good and bad data that has nothing to do with the quality of the data. In this framework, “good” means a data set without obvious missing values, one that already lies in a relational database or can be queried directly via an API. Ostensibly complex data structures are okay as long as it can be described as a graph. Most of all there has to be lots of it.

Issues such as

  • cleaning up the data (which includes quality control, getting the data into the right format and so on),
  • considering how to deal with obvious missing values,
  • asking if there is some subset of the data that is of sufficient quality,
  • asking if the data set has implicitly not included relevant data,
  • asking if this is the right data set for the questions you want to ask,
  • or what are the questions that can be asked from this data set

are seen as intellectually inferior to the algorithm. The fact that these steps typically take up most of the time in any analysis where the goal is to extract conclusions from a data set is largely forgotten. I should stress here of course that this point about the time taken by these steps has been discussed elsewhere at great length. The fact that an individual algorithm may not give much by way of insight is neither here nor there.

The algorithm is seen as the intellectual apex of the process. The rest, the grimy shitty work, is seen as detail. Researchers are rewarded accordingly – indeed we even shape the language so that researchers develop algorithms but service staff get the data into the right format.

Everybody wants to be Albert Einstein.

Nobody wants to be Marie Skłodowska-Curie.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s