Data Science Map

This is a wiki for learning the various skills that make a great data scientist.

There are lots of online courses relevant to "big data". Here are some:

http://bigdatauniversity.com/courses/

http://datascience101.wordpress.com/

=Basics= One of the first questions we can ask is: What, exactly, IS Big Data

Another area of interest to many companies is Time Series Analysis

=Road to becoming a data scientist= We start with "road to a data scientist" by Swami Chandrasekaran

Fundamentals
Matrices & Linear Algebra Fundamentals

Hash Functions, Binary Tree, O(n)

Relational Algebra, DB Basics

Inner, Outer, Cross, Theta Join

CAP Theorem

Tabular Data

Entropy

Data Frames & Series

Sharding

OLAP

Multidimensional Data Model

ETL

Reporting Vs BL Vs Analytics

JSON & XML

NoSQL

Regex

Vendor Landscape

Env Setup

Statistics
Pick a Dataset

Descriptive Statistics - mean, median, range, SD, var

Exploratory Data Analysis

Histograms

Percentiles & Outliers

Probability Theory

Bayes Theorem

Random Variables

Cumul Dist Fn (CDF)

Continuous Distributions - Normal, Poisson, Gaussian

Skewness

ANOVA

Prob Den Fn (PDF)

Central Limit Theorem

Monte Carlo Method

Hypothesis Testing

p-Value

Chi squared test

Estimation

Confidence Interval

MLE

Kernel Density Estimate

Regression

Covariance

Correlation

Pearson Coeff

Causation

Least Square Fit

Euclidian Distance

Programming
Topics here include areas that are more directly related to the programming side of data science.

Machine Learning
Add specific sections in the future - for now, some useful resources:

http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

Coursera Machine Learning Lecture Notes

Toolbox
MapReduce - originally the name for a proprietary implementation, now used more widely, especially in Hadoop

Hadoop in Python

see http://johanlouwers.blogspot.com/2012/02/map-reduce-into-relation-of-big-data.html

Oracle Coherenece - An Oracle distributed Hash Map

Message Passing Interface - started in the early 90s, a way to program for parallel computers.

Julia - A high performance, open-source scientific programming language that uses a Just In Time compiler. The syntax is very Matlab-like.