Data Science
Data Science
Data Science is a process, not an event. It is the process of using data to understand different things, to understand the world. For me is when you have a model or hypothesis of a problem, and you try to validate that hypothesis or model with your data. Data science is the art of uncovering the insights and trends that are hiding behind data.
Big Data
Data Science / What is Data Science / Big Data
Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value. The V's of Big Data: velocity, volume, variety, veracity, and value.
Hadoop
Data Science / What is Data Science / Big Data / Hadoop
Took the data and sliced it into pieces and distributed each and replicated each piece or triplicated each piece and they would send it the pieces of these files to thousands of computers first it was hundreds but then now it's thousands now it's tens of thousands. And then they would send the same program to all these computers in the cluster. And each computer would run the program on its little piece of the file and send the results back. The results would then be sorted and those results would then be redistributed back to another process. The first process is called a map or a mapper process and the second one was called a reduce process. So the one thing that's nice about these big data clusters is they scale linearly. You had twice as many servers and you get twice the performance and you can handle twice the amount of data.
Python
Data Science / Tools for DataScience / Languages for DataScience / Python
1. Python is a high-level general-purpose programming language that can be applied to many different classes of problems.
2. It has a large, standard library that provides tools suited to many different tasks, including but not limited to databases, automation, web scraping, text processing, image processing, machine learning, and data analytics.
3. For data science, you can use Python's scientific computing libraries such as Pandas, NumPy, SciPy, and Matplotlib.
4. For artificial intelligence, it has TensorFlow, PyTorch, Keras, and Scikit-learn.
5. Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK)
R
Data Science / Tools for DataScience / Languages for DataScience / R
R is most often used by statisticians, mathematicians, and data miners for developing statistical software, graphing, and data analysis. The language’s array-oriented syntax makes it easier to translate from math to code.
Pandas (Data structures & tools)
Data Science / Packages, APIs, Data Sets and Models / Libraries / Scientifics Computing / Pandas (Data structures & tools)
Pandas offers data structures and tools for effective data cleaning, manipulation, and analysis. It provides tools to work with different types of data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows. This table is called a “DataFrame” and is designed to provide easy indexing so you can work with your data.
Numpy (Arrays & matrices)
Data Science / Packages, APIs, Data Sets and Models / Libraries / Scientifics Computing / Numpy (Arrays & matrices)
NumPy libraries are based on arrays, enabling you to apply mathematical functions to these arrays. Pandas is actually built on top of NumPy Data visualization methods are a great way to communicate with others and show the meaningful results of analysis.
SciPy (Integrals, solving differential equations, optimization)
Data Science / Packages, APIs, Data Sets and Models / Libraries / Scientifics Computing / SciPy (Integrals, solving differential equations, optimization)
SciPy includes functions for some advanced math problems as listed on this slide, as well as data visualization.
Open data
Data Science / Packages, APIs, Data Sets and Models / Data Sets / Open data
Where to find open data
Open data portal list from around the world
- http://datacatalogs.org/
Governmental, intergovernmental and organization websites
- http://data.un.org/ (United Nations)
- https://www.data.gov/ (USA)
- https://www.europeandataportal.eu/en/ (Europe)
Kaggle
- https://www.kaggle.com/datasets
Google data set search
- https://datasetsearch.research.google.com/
Community Data License Aggrement
Data Science / Packages, APIs, Data Sets and Models / Data Sets / Community Data License Aggrement
Methodology
Data Science / Methodology
1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modeling to Evaluation
5. From Deployment to Feedback
Importing Datasets
Data Science / Data Analysis with Python / Importing Datasets
Data Wrangling
Data Science / Data Analysis with Python / Data Wrangling
Missing values
Data Science / Data Analysis with Python / Data Wrangling / Missing values
Drop the missing values
- drop the variable
- drop the data entry
Replace the missing values
- replace with an average (of similar datapoints)
- replace by frequency
- replace based on other functions
Leave it as missing data
Data Normalization
Data Science / Data Analysis with Python / Data Wrangling / Data Normalization

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1
Binning
Data Science / Data Analysis with Python / Data Wrangling / Binning
Grouping of values into "bins". Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.
Example:
In our dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?
One-hot encoding
Data Science / Data Analysis with Python / Data Wrangling / One-hot encoding
Turn categorical variables into quantitative variables.
Exploratory Data Analysis
Data Science / Data Analysis with Python / Exploratory Data Analysis
Descriptive Statistics
Data Science / Data Analysis with Python / Exploratory Data Analysis / Descriptive Statistics
df.describe(): count, mean, std, min, 25%, 50%, 75%, max
value_counts(): drive_wheels_counts=df["drive-wheels"].value_counts().to_frame()
Box Plot
Data Science / Data Analysis with Python / Exploratory Data Analysis / Descriptive Statistics / Box Plot
Type plots

sns.boxplot(x="drive-wheels", y="price", data=df)
Scatter Plot
Data Science / Data Analysis with Python / Exploratory Data Analysis / Descriptive Statistics / Scatter Plot
Type plots
GroupBy
Data Science / Data Analysis with Python / Exploratory Data Analysis / GroupBy
df_test = df[['drive-wheels', 'body-style', 'price']]
df_grp = df_test(['drive-wheels', 'body-style'], as_index=False).mean()
Pivot
Data Science / Data Analysis with Python / Exploratory Data Analysis / GroupBy / Pivot
One variable displayed along the columns and the other variable displayed along the rows
Eg: drive wheels displayed along the columns and body style displayed along the rows
df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')
Heatmap
Data Science / Data Analysis with Python / Exploratory Data Analysis / GroupBy / Heatmap
Type plots
Correlation
Data Science / Data Analysis with Python / Exploratory Data Analysis / Correlation
statistical metric for measuring to what extent different variables are interdependent. Or, if one variable changes, how does this affect change in the other variable
Pearson Correlation
Data Science / Data Analysis with Python / Exploratory Data Analysis / Correlation / Pearson Correlation
Measure the strength of the correlation between two features
- Correlation coefficient
- P-value
Correlation coefficient
- Close to +1: Large positive relationship
- Close to -1: Large negative relationship
- Close to 0: No relationship
P-value
- < 0.001 Strong certainty in the result
- < 0.05 Moderate certainty in the result
- < 0.1 Weak certainty in the result
- >0.1 No certainty in the result
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
ANOVA Analisis of Variance
Data Science / Data Analysis with Python / Exploratory Data Analysis / Correlation / ANOVA Analisis of Variance
ANOVA can be used to find the correlation between different groups of a categorical variable. According to the car dataset, we can use ANOVA to see if there is any difference in mean price for the different car makes such a Subaru and Honda. The ANOVA test returns two values, the F-test score and the p-value. The F-test calculates the ratio of variation between groups mean, over the variation within each of the sample groups. The p-value shows whether the obtained result is statistically significant. The F-test calculates the ratio of variation between groups means over the variation within each of the sample group means.
df_anova=df[['make', 'price']]
grouped_anova=df_anova.groupby(['make'])
anova_results_l=stats.f_oneway(grouped_anova.get_group('honda')['price'], grouped_anova.get_group('subaru')['price'])
ANOVA results: F=0.19744, p=F_onewayResult(statistic=0.197), pvalue=0.66)
Model Development
Data Science / Data Analysis with Python / Model Development
Regression Plot
Data Science / Data Analysis with Python / Model Development / Evaluation using Visualization / Regression Plot
Type plots
Residual Plot
Data Science / Data Analysis with Python / Model Development / Evaluation using Visualization / Residual Plot
Type plots

Subtracting the target value from the predicted value. Then plotting the value accordingly. We expect to see the results to have zero mean, distributed evenly around the x axis with similar variance.
Distribution Plot
Data Science / Data Analysis with Python / Model Development / Evaluation using Visualization / Distribution Plot
Type plots
One Dimension
Data Science / Data Analysis with Python / Model Development / Polynomial Regression / One Dimension
Many Dimensions
Data Science / Data Analysis with Python / Model Development / Polynomial Regression / Many Dimensions
Pipelines
Data Science / Data Analysis with Python / Model Development / Pipelines
There are many steps to getting a prediction. For example, normalization, polynomial transform, and linear regression. We simplify the process using a pipeline. Pipeline sequentially perform a series of transformations. The last step carries out a prediction
R-squared/R^2/Coefficient of Determination
Data Science / Data Analysis with Python / Model Development / Evaluation / R-squared/R^2/Coefficient of Determination

Near one to be efficient
Model Evaluation and Refinement
Data Science / Data Analysis with Python / Model Evaluation and Refinement
Cross Validation
Data Science / Data Analysis with Python / Model Evaluation and Refinement / Cross Validation

The dataset is split into K equal groups. Each group is referred to as a fold. Some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set, which we use to test the model. For example, we can use three folds for training, then use one fold for testing. This is repeated until each partition is used for both training and testing. At the end, we use the average results as the estimate of out-of-sample error.
Underfitting
Data Science / Data Analysis with Python / Model Evaluation and Refinement / Underfitting
Where the model is too simple to fit the data. If we increase the order of the polynomial, the model fits better, but the model is still not flexible enough and exhibits underfitting.
Overfitting
Data Science / Data Analysis with Python / Model Evaluation and Refinement / Overfitting
The model is too flexible and fits the noise rather than the function.
Ridge Regression
Data Science / Data Analysis with Python / Model Evaluation and Refinement / Ridge Regression

In many cases real data has outliers. This is especially evident for the higher order polynomials. Ridge regression controls the magnitude of these polynomial coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training the model. Each row in the following table represents an increasing value of alpha.
The column corresponds to the different polynomial coefficients, and the rows correspond to the different values of alpha. As alpha increases the parameters get smaller. This is most evident for the higher order polynomial features. But Alpha must be selected carefully. If alpha is too large, the coefficients will approach zero and underfit the data. If alpha is zero, the overfitting is evident. For alpha equal to 0.001, the overfitting begins to subside. For Alpha equal to 0.01, the estimated function tracks the actual function. When alpha equals one, we see the first signs of underfitting. The estimated function does not have enough flexibility. At alpha equals to 10, we see extreme underfitting. It does not even track the two points. In order to select alpha, we use cross validation.
Grid Search
Data Science / Data Analysis with Python / Model Evaluation and Refinement / Grid Search
Type Algorithm
Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.


A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in k-Nearest Neighbors, the number of hidden layers in Neural Networks.
In contrast, a parameter is an internal characteristic of the model and its value can be estimated from data. Example, beta coefficients of linear/logistic regression or support vectors in Support Vector Machines.
Data Visualization with Python
Data Science / Data Visualization with Python
Line Plots
Data Science / Data Visualization with Python / Line Plots
Type plots
Area Plots
Data Science / Data Visualization with Python / Area Plots
Type plots
Histograms
Data Science / Data Visualization with Python / Histograms
Type plots
Bar Charts
Data Science / Data Visualization with Python / Bar Charts
Type plots
Pie Charts
Data Science / Data Visualization with Python / Pie Charts
Type plots
Box Plots
Data Science / Data Visualization with Python / Box Plots
Type plots
Scatter Plots
Data Science / Data Visualization with Python / Scatter Plots
Type plots
Waffle Charts
Data Science / Data Visualization with Python / Waffle Charts
Type plots

Normally created to display progress toward goals
Word Clouds
Data Science / Data Visualization with Python / Word Clouds
Type plots

A Word cloud is a depiction of the frequency of different words in some textual data.
Regression Plots
Data Science / Data Visualization with Python / Regression Plots
Type plots
Maps with Folium
Data Science / Data Visualization with Python / Maps with Folium
Type plots
Choropleth Maps
Data Science / Data Visualization with Python / Choropleth Maps
Type plots
Regression/Estimation
Data Science / Machine Learning with Python / Techniques / Regression/Estimation
Predicting continuous values
Classification
Data Science / Machine Learning with Python / Techniques / Classification
Predicting the item class/category of a case
Associations
Data Science / Machine Learning with Python / Techniques / Associations
Associating frequent co-occurring items/events
Anomaly detection
Data Science / Machine Learning with Python / Techniques / Anomaly detection
Discovering abnormal and unusual cases
Sequence mining
Data Science / Machine Learning with Python / Techniques / Sequence mining
Predicting next events; click-stream (Markov Model, HMM)
Dimension Reduction
Data Science / Machine Learning with Python / Techniques / Dimension Reduction
Reducing the size of data
Principle Component Analysis (PCA)
Data Science / Machine Learning with Python / Techniques / Dimension Reduction / Linear / Principle Component Analysis (PCA)
Type Method

Calculating PCA
1. Centre the data
- Compute covariance matrix
- Find the eigenvectors and eigen values of covariance matrix
2. The eigenvectors become the principal components
3. The eigenvalues provide the explained variance
4. Select new dimensions and project the data
Recommendation systems
Data Science / Machine Learning with Python / Techniques / Recommendation systems
Regression
Data Science / Machine Learning with Python / Supervised Learning / Regression
Regression is the process of predicting a continuous value as opposed to predicting a categorical value in classification.
Simple Linear Regression
Data Science / Machine Learning with Python / Supervised Learning / Regression / Types / Simple Linear Regression
Simple Non-linear Regression
Data Science / Machine Learning with Python / Supervised Learning / Regression / Types / Simple Non-linear Regression
Multiple Linear Regression
Data Science / Machine Learning with Python / Supervised Learning / Regression / Types / Multiple Linear Regression
Multiple Non-linear Regression
Data Science / Machine Learning with Python / Supervised Learning / Regression / Types / Multiple Non-linear Regression
Download Sample Jupyter NotebookNon-linear regressions are a relationship between independent variables x and a dependent variable y which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of k degrees (maximum power of x ).
Non-linear functions can have elements like exponentials, logarithms, fractions, and others.
Evaluation Metrics
Data Science / Machine Learning with Python / Supervised Learning / Regression / Evaluation Metrics
Classification
Data Science / Machine Learning with Python / Supervised Learning / Classification
Classification is the process of predicting a discrete class label, or category.
Decision Trees (ID3, C4 5, C5 0)
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Decision Trees (ID3, C4 5, C5 0)
Download Sample Jupyter NotebookAlgorithm:
- Choose an attribute from your dataset.
- Calculate the significance of attribute in splitting of data.
- Split data based on the value of the best attribute
- Go to step 1
Building Decision Tree
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Decision Trees (ID3, C4 5, C5 0) / Building Decision Tree
Entropy
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Decision Trees (ID3, C4 5, C5 0) / Entropy
Random Forest
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Decision Trees (ID3, C4 5, C5 0) / Random Forest
Type Ensemble Learning Method
Bootstrap Aggregation (Bagging)
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Decision Trees (ID3, C4 5, C5 0) / Random Forest / Bootstrap Aggregation (Bagging)
Type Technique
The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:
k-Nearest Neighbor
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / k-Nearest Neighbor
Download Sample Jupyter Notebook
Algorithm
- Pick a value for K
- Calculate the distance of unknown case from all cases.
- Select the K-observations in the training data that are nearest to the unknown data point
- Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.
Logistic Regression
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Logistic Regression
Support Vector Machines (SVM)
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Support Vector Machines (SVM)
Download Sample Jupyter NotebookSVM is a supervised algorithm that classifies cases by finding a separator.
- Mapping data to a high-dimensional feature space
- Finding a separator


Applications:
- Image recognition
- Text category assignment
- Detecting spam
- Sentiment analysis
- Gene Expression Classification
- Regression, outlier detection and clustering
Kernelling
Data Science / Machine Learning with Python / Supervised Learning / Classification / Algorithms / Support Vector Machines (SVM) / Kernelling
Mapping data into a higher-dimensional space. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:
- Linear
- Polynomial
- Radial Basis Function (RBF)
- Sigmoid.
Jaccard index
Data Science / Machine Learning with Python / Supervised Learning / Classification / Evaluation metrics / Jaccard index
F1 Score
Data Science / Machine Learning with Python / Supervised Learning / Classification / Evaluation metrics / F1 Score
Log Loss
Data Science / Machine Learning with Python / Supervised Learning / Classification / Evaluation metrics / Log Loss

Near 0 -> Higher accuracy
Dimension reduction
Data Science / Machine Learning with Python / Unsupervised Learning / Dimension reduction
Dimensionality reduction, and/or feature selection, play a large role in this by reducing redundant features to make the classification easier.
Density estimation
Data Science / Machine Learning with Python / Unsupervised Learning / Density estimation
Density estimation is a very simple concept that is mostly used to explore the data to find some structure within it.
Market basket analysis
Data Science / Machine Learning with Python / Unsupervised Learning / Market basket analysis
Market basket analysis is a modeling technique based upon the theory that if you buy a certain group of items, you're more likely to buy another group of items.
Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering

Clustering is grouping of data points or objects that are somehow similar by:
- Discovering structure
- Summarization
- Anomaly detection
Partitioned-based Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Partitioned-based Clustering
Relatively efficient
E.g. k-Means, k-Median, Fuzzy c-Means
k-Means Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Partitioned-based Clustering / k-Means Clustering
Download Sample Jupyter Notebook1. Initialize k=(number of clusters) centroids randomly
2. Calculate the distance of each point from each centroid
3. Assign each data point (object) to its closest centroid, creating a cluster
4. Recalculate the position of the k centroids
5. Repeat until there are no more changes => Converged
Hierarchical Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Hierarchical Clustering
Build a hierarchy of clusters where each node is a cluster consists of the clusters of its daughter nodes
Produces trees of clusters
E.g. Agglomerative, Divisive
Top-down: divisive; Bottom-Up: agglomerative
Agglomerative Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Hierarchical Clustering / Agglomerative Clustering
Download Sample Jupyter Notebook1. Create n clusters, one for each data point
2. Compute the Proximity Matrix (distance between each clusters)
3. Select two closest clusters according to distance between each pair of points (distance measurement can either be Euclidean, Pearson, average distance or many others depending on data type and domain knowledge)
4. Merge two closest clusters into one cluster and calculate the distance between the new cluster (center of two cluster) to other clusters
5. Repeat step 3 until only a single cluster remains
Distance between clusters
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Hierarchical Clustering / Distance between clusters
Density-based Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Density-based Clustering
Produces arbitrary shaped clusters
E.g. DBSCAN
DBSCAN Clustering
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Density-based Clustering / DBSCAN Clustering
Download Sample Jupyter NotebookDensity-Based Spatial Clustering of Applications with Noise
R (Radius of neighborhood) that if includes enough number of points within, we call it a dense area
M (Min number of neighbors) The minimum number of data points we want in a neighborhood to define a cluster
Let's pick a point randomly. First, we check to see whether it's a core data point.
A data point is a core point if within neighborhood of the point there are at least M points.
For example, as there are six points in the two centimeter neighbor of the red point, we mark this point as a core point.
A data point is a border point if A; its neighbourhood contains less than M data points or B; it is reachable from some core point.
Here, reachability means it is within our distance from a core point. It means that even though the yellow point is within the two centimeter neighborhood of the red point, it is not by itself a core point because it does not have at least six points in its neighborhood.
The grey point is not a core point nor is it a border point. So, we'd label it as an outlier. An outlier is a point that is not a core point and also is not close enough to be reachable from a core point. We continue and visit all the points in the dataset and label them as either core, border, or outlier.
The next step is to connect core points that are neighbors and put them in the same cluster. So, a cluster is formed as at least one core point plus all reachable core points plus all their borders. It's simply shapes all the clusters and finds outliers as well.
Spherical-shape clusters
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Density-based Clustering / Types / Spherical-shape clusters
Arbitrary-shape clusters
Data Science / Machine Learning with Python / Unsupervised Learning / Clustering / Algorithms / Density-based Clustering / Types / Arbitrary-shape clusters
Applications
Data Science / Machine Learning with Python / Recommendation System / Applications
Where to buy? E-commerce, books, movies, beer, shoes
Where to eat?
Which job to apply to?
Who you should be friends with?
Personalize your experience on the web: News platforms, news personalization
Types
Data Science / Machine Learning with Python / Recommendation System / Types
Content-Based
Data Science / Machine Learning with Python / Recommendation System / Types / Content-Based
Collaborative Filtering
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering
User-based
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering / User-based
Based on users' neighborhood
Item-based
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering / Item-based
Based on items' similarity
Data Sparsity
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering / Challenges / Data Sparsity
Users in general rate only a limited number of items
Cold start
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering / Challenges / Cold start
Difficulty in recommendation to new users or new items
Scalability
Data Science / Machine Learning with Python / Recommendation System / Types / Collaborative Filtering / Challenges / Scalability
Increase in number of users or items
Memory-based
Data Science / Machine Learning with Python / Recommendation System / Implementing / Memory-based
Uses the entire user-item dataset to generate a recommendation
Uses statistical techniques to approximate users or items e.g., Pearson Correlation, Cosine Similarity, Euclidean Distance, etc.
Model-based
Data Science / Machine Learning with Python / Recommendation System / Implementing / Model-based
Develops a model of users in an attempt to learn their preferences
Models can be created using Machine Learning techniques like regression, clustering, classification, etc.