What is the best way for getting started in Statistics for Programmers/Data Science?

What is the best way for getting started in Statistics for Programmers/Data Science?

I am often asked this question: What’s the best way for getting started in Statistics for Programmers?

At the Data Science for IoT course – and also in my teaching at Oxford University – I have used the following approach.

Comments welcome:

Firstly, the interest in Statistics for Programmers is a fairly recent phenomenon.

This interest is based on the uptake of Data Science – a hot profession now.

Here’s how most people approach the problem

They pick up an old High School statistics text book – either their own from younger days– or a standard book.

These books are often decades old.

They start with page One .. and work linearly through a few pages ..

They quickly realize why they disliked stats earlier.

And that sentiment has not changed with the passage of time ..

But, here is a different approach

For Data Science, you do not need to master Statistics per se

You need to understand Statistical models.

A model is defined as a combination of  predictive algorithms (based on Statistics) and Data.

Data science is based on creating models that improve with experience / training/

In contrast, in the Data Science for IoT course – we start with problems (the Engineering approach).

I recommend three sources which I am using (if you have others, please let me know at ajit.jaokar at futuretext.com and I shall link them and refer back to you)

Start with Understanding the problem

See these two links by @Brandon Rohrer  (@Microsoft Data Science)  -

Which algorithm family can answer my question and

Which questions can Data Science answer.

See also this post by Dr Vincent Granville @DataScienceCtrl

on 24 uses of Statistical modelling Part 1 and  2

These posts give you an idea of the problems that can be solved using Data science and stats(without going into the math itself initially)

Then read Allen Downey’s books

Allen Downney writes excellent books and they are all free under creative commons. You can download them  at Green Tea Press and they have an excellent ethos. Especially – Think Stats, Think Bayes, Think complexity (in that order).

To encourage the author I would also encourage you to buy these books especially Think Stats.

You can follow him on Twitter @allendowney

Having mastered to this stage, then start with code and small datasets.

I prefer UCI datasets and Python scikit learn library.

Sumit also works with the REPL approach and Paul uses Spark notebook in our course.

In any case, these are small sections of code run in a controlled environment and show you how the stats are implemented(libraries / APIs like scikit learn – are relatively easier to understand if you come from a Programming background)

Thats the path we are using in the Data Science for IoT course.

Any comments/feedback welcome on your approach to teach statistics (ajit.jaokar at futuretext.com)

Image source: Scatter plots – wikipedia