Quick Start

Last updated: September 20th, 2020
Data ScienceWhat is Data ScienceTools for DataSciencePackages, APIs, Data Sets and ModelsMethodologyData Analysis with PythonData Visualization with PythonMachine Learning with PythonData MiningBig DataLanguages for DataScienceData Science ToolsLibrariesData SetsImporting DatasetsData WranglingExploratory Data AnalysisModel DevelopmentModel Evaluation and RefinementLine PlotsArea PlotsHistogramsBar ChartsPie ChartsBox PlotsScatter PlotsWaffle ChartsWord CloudsRegression PlotsMaps with FoliumChoropleth MapsTechniquesSupervised LearningUnsupervised LearningRecommendation SystemHadoopPythonROpen SourceCommercial ToolsCloud Based ToolsScientifics ComputingVisualizationMachine Learning & Deep LearningDeep Learning LibrariesOpen dataCommunity Data License AggrementMissing valuesData NormalizationBinningOne-hot encodingDescriptive StatisticsGroupByCorrelationSimple Linear RegressionMultiple Linear RegressionEvaluation using VisualizationPolynomial RegressionPipelinesEvaluationCross ValidationUnderfittingOverfittingRidge RegressionGrid SearchRegression/EstimationClassificationClusteringAssociationsAnomaly detectionSequence miningDimension ReductionRecommendation systemsRegressionClassificationDimension reductionDensity estimationMarket basket analysisClusteringApplicationsTypesImplementingData ManagementData Integration And TransformationData VisualizationModel DeploymentModel Monitoring and AssessmentCode Asset ManagementData Asset ManagementDevelopment EnvironmentsExecution EnvironmentsFully Integrated Visual ToolsData ManagementData Integration And TransformationData VisualizationModel Building CommercialModel DeploymentData Asset ManagementDevelopment EnvironmentsExecution EnvironmentsFully Integrated Visual ToolsFully Integrated Visual Tools and PlatformsData ManagementData Integration And TransformationData VisualizationModel Building Cloud BasedModel DeploymentModel Monitoring and AssessmentPandas (Data structures & tools)Numpy (Arrays & matrices)SciPy (Integrals, solving differential equations, optimization)Matplotlib (plots & graphs, most popular)Seaborn (plots: heat maps, time series, violin plots)Scikit-learn (Machine Learning: regression, classification, )Statsmodels (Explore data, estimate statistical models and perform statistical tests)TensorFlow (Deep Learning: Production and Deployment)PyTorch (Deep Learning: regression, classification, )Simple Feature scalingMin-MaxZ-scoreBox PlotScatter PlotPivotHeatmapPearson CorrelationANOVA Analisis of VarianceRegression PlotResidual PlotDistribution PlotOne DimensionMany DimensionsMean Squared Error (MSE)R-squared/R^2/Coefficient of DeterminationLinearNon-linearTypesApplicationsAlgorithmsEvaluation MetricsApplicationsAlgorithmsEvaluation metricsApplicationsAlgorithmsContent-BasedCollaborative FilteringMemory-basedModel-basedPrinciple Component Analysis (PCA)Linear Discriminant Analysis (LDA)Canonical correlation analysisMulti-dimensional scalingManifold learning (eg SOM, autoencoders etc)Simple Linear RegressionSimple Non-linear RegressionMultiple Linear RegressionMultiple Non-linear RegressionSales forecastingSatisfaction analysisPrice estimationEmployment incomeOrdinal regressionPoisson regressionFast forest quantile regressionLinear, Polynomial, Lasso, Stepwise, Ridge regressionBayesian linear regressionNeural network regressionDecision forest regressionBoosted decision tree regressionKNN (K-nearest neighbors)Email filteringSpeech recognitionHandwriting recognitionBiometric identificationDocument classificationDecision Trees (ID3, C4 5, C5 0)Naive BayesLinear Discriminant Analysisk-Nearest NeighborLogistic RegressionNeural NetworksSupport Vector Machines (SVM)Jaccard indexF1 ScoreLog LossRetail/Marketing: Identify buying patterns of customersRetail/Marketing: Recommending new books or movies to new customersBanking: Fraud detection in credit card useBanking: Identifying clusters of customers (eg: loyal)Insurance: Fraud detection in claims analysisInsurance: Insurance risk of customersPublication: Auto-categorizing news based on their contentPublication: Recommending similar news articlesMedicine: Characterizing patient behaviorBiology: Clustering genetic markers to identify family tiesPartitioned-based ClusteringHierarchical ClusteringDensity-based ClusteringUser-basedItem-basedChallengesBuilding Decision TreePredictivenessImpurityEntropyInformation GainRandom ForestGradient Boosted TreesKernellingk-Means ClusteringAgglomerative ClusteringDistance between clustersDistance measurementDBSCAN ClusteringTypesData SparsityCold startScalabilityBootstrap Aggregation (Bagging)BoostingSingle-Linkage ClusteringComplete-Linkage ClusteringAverage-Linkage ClusteringCentroid-Linkage ClusteringEuclideanPearsonAverage distanceSpherical-shape clustersArbitrary-shape clusters

Data Science

Data Science is a process, not an event. It is the process of using data to understand different things, to understand the world. For me is when you have a model or hypothesis of a problem, and you try to validate that hypothesis or model with your data. Data science is the art of uncovering the insights and trends that are hiding behind data.

Big Data

Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value. The V's of Big Data: velocity, volume, variety, veracity, and value.

Hadoop

Took the data and sliced it into pieces and distributed each and replicated each piece or triplicated each piece and they would send it the pieces of these files to thousands of computers first it was hundreds but then now it's thousands now it's tens of thousands. And then they would send the same program to all these computers in the cluster. And each computer would run the program on its little piece of the file and send the results back. The results would then be sorted and those results would then be redistributed back to another process. The first process is called a map or a mapper process and the second one was called a reduce process. So the one thing that's nice about these big data clusters is they scale linearly. You had twice as many servers and you get twice the performance and you can handle twice the amount of data.

Python

1. Python is a high-level general-purpose programming language that can be applied to many different classes of problems.
2. It has a large, standard library that provides tools suited to many different tasks, including but not limited to databases, automation, web scraping, text processing, image processing, machine learning, and data analytics.
3. For data science, you can use Python's scientific computing libraries such as Pandas, NumPy, SciPy, and Matplotlib.
4. For artificial intelligence, it has TensorFlow, PyTorch, Keras, and Scikit-learn.
5. Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK)

R

R is most often used by statisticians, mathematicians, and data miners for developing statistical software, graphing, and data analysis. The language’s array-oriented syntax makes it easier to translate from math to code.

Data Science Tools

Data Management Tools

Relational databases such as MySQL and PostgreSQL; NoSQL databases such as MongoDB Apache CouchDB, and Apache Cassandra; and file-based tools such as the Hadoop File System or Cloud File systems like Ceph. Elasticsearch is mainly used for storing text data and creating a search index for fast document retrieval.

Data Integration And Transformation

Apache AirFlow, originally created by AirBNB; KubeFlow, which enables you to execute data science pipelines on top of Kubernetes; Apache Kafka, which originated from LinkedIn; Apache Nifi, which delivers a very nice visual editor; Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of 1000s of nodes), and NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that it even runs on small devices like a Raspberry Pi.

Data Visualization

Hue, which can create visualizations from SQL queries. Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the data provider). Apache Superset is a data exploration and visualization web application.

Model Deployment

Apache PredictionIO currently only supports Apache Spark ML models for deployment, but support for all sorts of other libraries is on the roadmap. Seldon is an interesting product since it supports nearly every framework, including TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and Redhat OpenShift. Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can serve any of its models using the TensorFlow service. You can deploy to an embedded device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web browser using TensorFlow.JS.

Model Monitoring and Assessment

ModelDB is a machine model metadatabase where information about the models are stored and can be queried. It natively supports Apache Spark ML Pipelines and scikit-learn. A generic, multi-purpose tool called Prometheus is also widely used for machine learning model monitoring, although it’s not specifically made for this purpose. Model performance is not exclusively measured through accuracy. Model bias against protected groups like gender or race is also important. The IBM AI Fairness 360 open source toolkit does exactly this. It detects and mitigates against bias in machine learning models. Machine learning models, especially neural-network-based deep learning models, can be subject to adversarial attacks, where an attacker tries to fool the model with manipulated data or by manipulating the model itself. The IBM Adversarial Robustness 360 Toolbox can be used to detect vulnerability to adversarial attacks and help make the model more robust. Machine learning modes are often considered to be a black box that applies some mysterious “magic.” The IBM AI Explainability 360 Toolkit makes the machine learning process more understandable by finding similar examples within a dataset that can be presented to a user for manual comparison. The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine learning model by explaining how different input variables affect the final decision of the model.

Code Asset Management

Git is now the standard. Multiple services have emerged to support Git, with the most prominent being GitHub, which provides hosting for software development version management. The runner-up is definitely GitLab, which has the advantage of being a fully open source platform that you can host and manage yourself. Another choice is Bitbucket.

Data Asset Management

Data asset management, also known as data governance or data lineage, is another crucial part of enterprise grade data science. Data has to be versioned and annotated with metadata. Apache Atlas is a tool that supports this task. Another interesting project, ODPi Egeria, is managed through the Linux Foundation and is an open ecosystem. It offers a set of open APIs, types, and interchange protocols that metadata repositories use to share and exchange data. Finally, Kylo is an open source data lake management software platform that provides extensive support for a wide range of data asset management tasks.

Development Environments

Execution Environments

Sometimes your data doesn’t fit into a single computer’s storage or main memory capacity. That’s where cluster execution environments come in. The well known cluster-computing framework Apache Spark is among the most active Apache projects and is used across all industries, including in many Fortune 500 companies. The key property of Apache Spark is linear scalability. This means, if you double the number of servers in a cluster, you’ll also roughly double its performance. After Apache Spark began to gain market share, Apache Flink was created. The key difference between Apache Spark and Apache Flink is that Apache Spark is a batch data processing engine, capable of processing huge amounts of data file by file. Apache Flink, on the other hand, is a stream processing image, with its main focus on processing real-time data streams. Although engine supports both data processing paradigms, Apache Spark is usually the choice in most use cases. One of the latest developments in the data science execution environments is called “Ray,” which has a clear focus on large-scale deep learning model training.

Fully Integrated Visual Tools

Most important tasks are supported by these tools; these tasks include data integration, transformation, data visualization, and model building. KNIME originated at the University of Konstanz in 2004. As you can see, KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in visualization capabilities. Knime can be be extended by programming in R and Python, and has connectors to Apache Spark. Another example of this group of tools is Orange. It’s less flexible than KNIME, but easier to use.

Data Management Commercial Tools

Data Integration And Transformation Commercial Tools

Data Visualization Commercial Tools

Model Building Tools

Model Deployment Commercial Tools

Data Asset Management Commercial Tools

Development Environments Commercial Tools

Execution Environments Commercial Tools

IBM Watson Studio Desktop

Commercial Fully Integrated Visual Tools

Cloud Based Fully Integrated Visual Tools

IBM Watson Studio
IBM Watson OpenScale
Azure Machine Learning
H2O Driverless AI

Data Management Cloud Based Tools

Amazon DynamoDB
Cloudant
CouchDB relax
IBM Db2

Data Integration And Transformation Cloud Based Tools

Informatica
IBM Data Refinery

Data Visualization Cloud Based Tools

Datameer
IBM Cognos Analytics

Model Building Tools

IBM Watson Machine Learning
Google AI Platform Training

Model Deployment Cloud Based Tools

IBM SPSS Collaboration and Deployment Services
IBM Watson Machine Learning

Model Monitoring and Assessment Cloud Tools

Amazon SageMaker Model Monitor
IBM Watson OpenScale

Pandas (Data structures & tools)

Pandas offers data structures and tools for effective data cleaning, manipulation, and analysis. It provides tools to work with different types of data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows. This table is called a “DataFrame” and is designed to provide easy indexing so you can work with your data.

Numpy (Arrays & matrices)

NumPy libraries are based on arrays, enabling you to apply mathematical functions to these arrays. Pandas is actually built on top of NumPy Data visualization methods are a great way to communicate with others and show the meaningful results of analysis.

SciPy (Integrals, solving differential equations, optimization)

SciPy includes functions for some advanced math problems as listed on this slide, as well as data visualization.

Open data

Where to find open data
Open data portal list from around the world
- http://datacatalogs.org/
Governmental, intergovernmental and organization websites
- http://data.un.org/ (United Nations)
- https://www.data.gov/ (USA)
- https://www.europeandataportal.eu/en/ (Europe)
Kaggle
- https://www.kaggle.com/datasets
Google data set search
- https://datasetsearch.research.google.com/

Community Data License Aggrement

http://cdla.io

Methodology

1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modeling to Evaluation
5. From Deployment to Feedback

Importing Datasets

Data Wrangling

Missing values

Drop the missing values
- drop the variable
- drop the data entry

Replace the missing values
- replace with an average (of similar datapoints)
- replace by frequency
- replace based on other functions

Leave it as missing data

Data Normalization

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1

Binning

Grouping of values into "bins". Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.

Example:
In our dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?

One-hot encoding

Turn categorical variables into quantitative variables.

Exploratory Data Analysis

Descriptive Statistics

df.describe(): count, mean, std, min, 25%, 50%, 75%, max
value_counts(): drive_wheels_counts=df["drive-wheels"].value_counts().to_frame()

Box Plot

Type plots


sns.boxplot(x="drive-wheels", y="price", data=df)

Scatter Plot

Type plots

GroupBy

df_test = df[['drive-wheels', 'body-style', 'price']]
df_grp = df_test(['drive-wheels', 'body-style'], as_index=False).mean()

Pivot

One variable displayed along the columns and the other variable displayed along the rows
Eg: drive wheels displayed along the columns and body style displayed along the rows
df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')

Heatmap

Type plots

Correlation

statistical metric for measuring to what extent different variables are interdependent. Or, if one variable changes, how does this affect change in the other variable

Pearson Correlation

Measure the strength of the correlation between two features
- Correlation coefficient
- P-value

Correlation coefficient
- Close to +1: Large positive relationship
- Close to -1: Large negative relationship
- Close to 0: No relationship

P-value
- < 0.001 Strong certainty in the result
- < 0.05 Moderate certainty in the result
- < 0.1 Weak certainty in the result
- >0.1 No certainty in the result

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

ANOVA Analisis of Variance

ANOVA can be used to find the correlation between different groups of a categorical variable. According to the car dataset, we can use ANOVA to see if there is any difference in mean price for the different car makes such a Subaru and Honda. The ANOVA test returns two values, the F-test score and the p-value. The F-test calculates the ratio of variation between groups mean, over the variation within each of the sample groups. The p-value shows whether the obtained result is statistically significant. The F-test calculates the ratio of variation between groups means over the variation within each of the sample group means.
df_anova=df[['make', 'price']]
grouped_anova=df_anova.groupby(['make'])
anova_results_l=stats.f_oneway(grouped_anova.get_group('honda')['price'], grouped_anova.get_group('subaru')['price'])
ANOVA results: F=0.19744, p=F_onewayResult(statistic=0.197), pvalue=0.66)

Model Development

Regression Plot

Type plots

Residual Plot

Type plots

Subtracting the target value from the predicted value. Then plotting the value accordingly. We expect to see the results to have zero mean, distributed evenly around the x axis with similar variance.

Distribution Plot

Type plots

One Dimension

Many Dimensions

Pipelines

There are many steps to getting a prediction. For example, normalization, polynomial transform, and linear regression. We simplify the process using a pipeline. Pipeline sequentially perform a series of transformations. The last step carries out a prediction

R-squared/R^2/Coefficient of Determination

Near one to be efficient

Model Evaluation and Refinement

Cross Validation


The dataset is split into K equal groups. Each group is referred to as a fold. Some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set, which we use to test the model. For example, we can use three folds for training, then use one fold for testing. This is repeated until each partition is used for both training and testing. At the end, we use the average results as the estimate of out-of-sample error.

Underfitting

Where the model is too simple to fit the data. If we increase the order of the polynomial, the model fits better, but the model is still not flexible enough and exhibits underfitting.

Overfitting

The model is too flexible and fits the noise rather than the function.

Ridge Regression

In many cases real data has outliers. This is especially evident for the higher order polynomials. Ridge regression controls the magnitude of these polynomial coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training the model. Each row in the following table represents an increasing value of alpha.
The column corresponds to the different polynomial coefficients, and the rows correspond to the different values of alpha. As alpha increases the parameters get smaller. This is most evident for the higher order polynomial features. But Alpha must be selected carefully. If alpha is too large, the coefficients will approach zero and underfit the data. If alpha is zero, the overfitting is evident. For alpha equal to 0.001, the overfitting begins to subside. For Alpha equal to 0.01, the estimated function tracks the actual function. When alpha equals one, we see the first signs of underfitting. The estimated function does not have enough flexibility. At alpha equals to 10, we see extreme underfitting. It does not even track the two points. In order to select alpha, we use cross validation.

Data Visualization with Python

Line Plots

Type plots

Area Plots

Type plots

Histograms

Type plots

Bar Charts

Type plots

Pie Charts

Type plots

Box Plots

Type plots

Scatter Plots

Type plots

Waffle Charts

Type plots

Normally created to display progress toward goals

Word Clouds

Type plots

A Word cloud is a depiction of the frequency of different words in some textual data.

Regression Plots

Type plots

Maps with Folium

Type plots

Choropleth Maps

Type plots

Regression/Estimation

Predicting continuous values

Classification

Predicting the item class/category of a case

Associations

Associating frequent co-occurring items/events

Anomaly detection

Discovering abnormal and unusual cases

Sequence mining

Predicting next events; click-stream (Markov Model, HMM)

Dimension Reduction

Reducing the size of data

Principle Component Analysis (PCA)

Type Method


Calculating PCA
1. Centre the data
- Compute covariance matrix
- Find the eigenvectors and eigen values of covariance matrix
2. The eigenvectors become the principal components
3. The eigenvalues provide the explained variance
4. Select new dimensions and project the data

Recommendation systems

Recommending items

Regression

Regression is the process of predicting a continuous value as opposed to predicting a categorical value in classification.

Simple Linear Regression

Simple Non-linear Regression

Multiple Linear Regression

Multiple Non-linear Regression


Download Sample Jupyter Notebook
Non-linear regressions are a relationship between independent variables x and a dependent variable y which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of k degrees (maximum power of x ).
Non-linear functions can have elements like exponentials, logarithms, fractions, and others.

Evaluation Metrics

Classification

Classification is the process of predicting a discrete class label, or category.

Decision Trees (ID3, C4 5, C5 0)

Download Sample Jupyter Notebook
Algorithm:
- Choose an attribute from your dataset.
- Calculate the significance of attribute in splitting of data.
- Split data based on the value of the best attribute
- Go to step 1

Building Decision Tree

Entropy

Information Gain


Information Gain is the information that can increase the level of certainty after splitting
Information Gain = (Entropy before split) - (weighted entropy after split)

Random Forest

Type Ensemble Learning Method


Bootstrap Aggregation (Bagging)

Type Technique

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

k-Nearest Neighbor

Download Sample Jupyter Notebook

Algorithm
- Pick a value for K
- Calculate the distance of unknown case from all cases.
- Select the K-observations in the training data that are nearest to the unknown data point
- Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.

Logistic Regression

Support Vector Machines (SVM)

Download Sample Jupyter Notebook
SVM is a supervised algorithm that classifies cases by finding a separator.
- Mapping data to a high-dimensional feature space
- Finding a separator


Applications:
- Image recognition
- Text category assignment
- Detecting spam
- Sentiment analysis
- Gene Expression Classification
- Regression, outlier detection and clustering

Kernelling

Mapping data into a higher-dimensional space. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:
- Linear
- Polynomial
- Radial Basis Function (RBF)
- Sigmoid.

Jaccard index

F1 Score

Log Loss


Near 0 -> Higher accuracy

Dimension reduction

Dimensionality reduction, and/or feature selection, play a large role in this by reducing redundant features to make the classification easier.

Density estimation

Density estimation is a very simple concept that is mostly used to explore the data to find some structure within it.

Market basket analysis

Market basket analysis is a modeling technique based upon the theory that if you buy a certain group of items, you're more likely to buy another group of items.

Clustering


Clustering is grouping of data points or objects that are somehow similar by:
- Discovering structure
- Summarization
- Anomaly detection

Partitioned-based Clustering

Relatively efficient
E.g. k-Means, k-Median, Fuzzy c-Means

k-Means Clustering

Download Sample Jupyter Notebook
1. Initialize k=(number of clusters) centroids randomly
2. Calculate the distance of each point from each centroid
3. Assign each data point (object) to its closest centroid, creating a cluster
4. Recalculate the position of the k centroids
5. Repeat until there are no more changes => Converged

Hierarchical Clustering

Build a hierarchy of clusters where each node is a cluster consists of the clusters of its daughter nodes
Produces trees of clusters
E.g. Agglomerative, Divisive

Top-down: divisive; Bottom-Up: agglomerative

Agglomerative Clustering

Download Sample Jupyter Notebook
1. Create n clusters, one for each data point
2. Compute the Proximity Matrix (distance between each clusters)
3. Select two closest clusters according to distance between each pair of points (distance measurement can either be Euclidean, Pearson, average distance or many others depending on data type and domain knowledge)
4. Merge two closest clusters into one cluster and calculate the distance between the new cluster (center of two cluster) to other clusters
5. Repeat step 3 until only a single cluster remains

Distance between clusters

Density-based Clustering

Produces arbitrary shaped clusters
E.g. DBSCAN

DBSCAN Clustering

Download Sample Jupyter Notebook
Density-Based Spatial Clustering of Applications with Noise
R (Radius of neighborhood) that if includes enough number of points within, we call it a dense area
M (Min number of neighbors) The minimum number of data points we want in a neighborhood to define a cluster

Let's pick a point randomly. First, we check to see whether it's a core data point.
A data point is a core point if within neighborhood of the point there are at least M points.
For example, as there are six points in the two centimeter neighbor of the red point, we mark this point as a core point.
A data point is a border point if A; its neighbourhood contains less than M data points or B; it is reachable from some core point.
Here, reachability means it is within our distance from a core point. It means that even though the yellow point is within the two centimeter neighborhood of the red point, it is not by itself a core point because it does not have at least six points in its neighborhood.
The grey point is not a core point nor is it a border point. So, we'd label it as an outlier. An outlier is a point that is not a core point and also is not close enough to be reachable from a core point. We continue and visit all the points in the dataset and label them as either core, border, or outlier.
The next step is to connect core points that are neighbors and put them in the same cluster. So, a cluster is formed as at least one core point plus all reachable core points plus all their borders. It's simply shapes all the clusters and finds outliers as well.

Spherical-shape clusters

Arbitrary-shape clusters

Applications

Where to buy? E-commerce, books, movies, beer, shoes
Where to eat?
Which job to apply to?
Who you should be friends with?
Personalize your experience on the web: News platforms, news personalization

Types

Content-Based

Collaborative Filtering

User-based

Based on users' neighborhood

Item-based

Based on items' similarity

Data Sparsity

Users in general rate only a limited number of items

Cold start

Difficulty in recommendation to new users or new items

Scalability

Increase in number of users or items

Memory-based

Uses the entire user-item dataset to generate a recommendation
Uses statistical techniques to approximate users or items e.g., Pearson Correlation, Cosine Similarity, Euclidean Distance, etc.

Model-based

Develops a model of users in an attempt to learn their preferences
Models can be created using Machine Learning techniques like regression, clustering, classification, etc.