Awesome Open Source
Awesome Open Source

A visitor to New York City asked a passerby for directions to the city's famous classical music venue:

Visitor: Excuse me, how do I get to Carnegie Hall?

Passer by: Practice, practice, practice!

Course Information


Run Python on the Cloud

  • Use Interactive Python Shell (Interpreter)
    • No registration is required
  • Run Python Scripts through Command Line
    • Requires account registration
    • Need basic Linux knowledge
  • Use Jupyter Notebooks
    • Google Colab
      • Plus: Seemless integration with GitHub
      • Plus: Acess to Linux terminal to run Python Scripts
      • Plus: Data science community with public datasets and notebooks for learning

Four Levels of Python Code

  1. Syntax (most basic programming requirements)
  2. Idiom (use of .join for string concatenation)
  3. Design Patterns (best practices and approaches to common problems and issues)
  4. Architectural (Overall project structure)

Most books and courses teach level 1 and 2 and rarely touch on level 3 and 4.

Four Paradigms of Python Programming

  • Imperative
  • Procedural
  • Object-oriented
  • Functional

Your Responsibility

  • Written Code - Solely your responsibility - Make sure it is clean, correct, and commented (3C rule)
  • Source Data - Primary data is your responsibity. You have no control over secondary data so be careful in the selection and cleansing.
  • Existing Libraries - You have no control on existing libraries/algothorithms so be careful in selecting and using them.
  • Interpretation of Results - Be careful about what is objective and what is subjective and what data exhibit and what experts know.

The one thing you have absolute control is the code you write. Make sure don't write bad code (complicated, incorrect, and undocumented code), so-called spaghetti code.

Wikipedia's Definition of Spaghetti Code:

"Spaghetti code is a pejorative phrase for unstructured and difficult-to-maintain source code. Spaghetti code can be caused by several factors, such as volatile project requirements, lack of programming style rules, and insufficient ability or experience."

Jupyter Notebooks

The name Jupyter comes from the fact that it supports writing code in three popular languages:

  • Julia
  • Python
  • R

Julia and R are popular for statistical analysis and data science. Python is a more generic programming language that happens to be popular in data science as well, though Python is good for all kinds of development, not just data science.

Dataviz Six Steps

  1. Define Problem and Ask Questions
  2. Define Data Source and Elements
  3. Tidy up Data (Normalize "messy" data so that is is "Tidy". )
  4. Summarize Data (Summarize/Tablulate, descriptive statistics)
  5. Visualize Data (static and interactive)
  6. Interpret and Communicate Results

Check out this paper for data tidying.

All Six steps must be guided by domain knowledge, principles, and purposes.

Dataviz - Plots/Charts

  • Univariate
    • Categorical Variable
      • Frequency table
      • Bar chart (x=Categories, y=Count)
      • Pareto Chart (sorted + accumulated %)
      • Pie Chart (Avoid it when there are t0o many categories)
    • Numerical Variable (discrete or continuous)
      • Histogram - frequency distribution
      • Boxplot - five-number summary statistics (centrality and dispersion)
      • Line Chart - trend over time
      • Area Chart - Trend over time
    • Textual Variable/Data
      • Wordcloud
  • Multivariate
    • Two categorical variables
      • Contingency table, pivot table
      • Stacked Bar Chart (one bar on top of the other)
      • Grouped Bar Chart (one bar next to each other)
    • Two numerical variables (correlation)
      • 2D Scatter Plot
      • 3D scatter plot
      • Bubble Chart (Scatter plot with varying size of dots based on the third numerical variable)
      • Motion Chart (Scatter plot with time frame for playback)
      • Scatter Plot with varying colors and Shapes of marks reflecting additional categorical variables (dimensions)
      • Line chart with multiple lines differentiated by color
    • One Numerical and one categorical variable
      • Bar chart (x=categories of the categorical variable, y=statistics of the numerical variable)
      • Statistics include mean, min, max, median, ...


Alternatives To Python Stats Dataviz
Select To Compare

Alternative Project Comparisons
Related Awesome Lists
Top Programming Languages
Top Projects

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (806,114
Jupyter Notebook (153,976
Visualization (15,231
Chart (13,524
Statistics (10,483
Data Science (10,142
Data Visualization (5,662
Statistical Analysis (690