Data Science - Journey through Life Cycles. Part 1

@Abdul Azeez Thekkekandy

4.94/5 (10 votes)

26 Nov 2019CPOL10 min read

15.7K

This article explores Data Science lifecycles - Business Understanding, Data Understanding and Data Preparation

Introduction

Everyone is talking about Data Science and its different steps and phases. This article explores Data Science lifecycles, different steps (in each lifecycle) and would be a great start for Data Scientist beginners in Data Science journey.

Data Science Lifecycle

By its simple definition, Data Science is a multi-disciplinary field that contains multiple processes to extract knowledge or useful output from Input Data. The output may be Predictive or Descriptive analysis, Report, Business Intelligence, etc. Data Science has well-defined lifecycles similar to any other projects and CRISP-DM and TDSP are some of the proven standards.

CRISP – DM: Cross Industry Process for Standard Data Mining

TDSP: Team Data Science Process by Microsoft

Let's check common lifecycles in Data Science:

Business Understanding
Data Understanding
Data Explore and Preparation
Create and Evaluate Model
Deploy Model and turn out effective output

Each lifecycle has different steps and rules to achieve the desired outcome. Multiple iterations of each lifecycle and different component in each cycle make Data Science output as more accurate. In this article, we are going to discuss the first 3 phases in Data Science lifecycles - Business Understanding, Data Understanding and Data Explore and Preparation.

Business Understanding

Business Understanding is always a key phase in any SDLC but it is more critical in Data Science lifecycle. If we misunderstood business, then we would end up with the wrong outcome or even we predicted good output but not acceptable by the customer. The main steps in this phase are:

Identify Stakeholder(s)

Stakeholders are Business Analyst/Expert people and their responsibility is clear all business query from Data Scientist in any phase of Data Science lifecycles.

Set Objective

Understand the business problem and identify whether the problem is applicable to an analytical solution or in other words, if Data science can target business problems. To achieve this, Data Scientist frames the business objective by asking relevant and sharp question to stakeholders. Please find this blog for some relevant questions to be asked to stakeholders.

After this step, Data Scientist summarizes:

Proper business requirement and define requirement like - How to increase business profit from 50 to 100 / How to prevent Customer Churn rate, etc.
Identity Data Science problem type from the Business requirement and find some Data Science Problem Types.

Data Science Problem Type

Predictive Analysis	What will happen in the future?
Descriptive Analysis	What is happening in the past and now?
Prescriptive Analysis	What should be done to enhance or prevent current or future happening?

Identify and Define Target Variable

Each Data science project contains either Supervised Or Unsupervised learning data.

Supervised learning – In Supervised learning, Data Scientist identifies input and output(target) feature. There are 2 types of Supervised learning.

Classification Problem
1. Binary Classification - Identified target feature value is either 0 or 1. Examples – whether people survived or not, is this email Spam or not, etc.
2. Category Classification - Identified target feature value contains multiple discrete values. Examples - How Products are tagged by different category.
Regression Problem – Identified target feature values is a continuous type. Examples - Expected mileage for the different type of Cars.

Unsupervised learning – Learn data without being given the correct output feature and mainly focus on the grouping of data or Clustering of Data. Examples - Identify human behavior from video visuals.

Feature and Observation (Columns and Rows)

Data Science Project Execution Plan

Use Project Management Tool like VSTS, JIRA, etc. and create a project execution plan and track each milestone and deliverable in different stages of Data Science lifecycles.

Data Understanding

The main steps in this phase are:

Collect Proper Data Set

Data Scientist collects proper data set that covers all business objectives and ensures Data Sets have required input features that answer all business questions. Data might be stored in a CSV file, database or in different formats and storage media. We can access either entire data and download from its source or through data streaming using a secured API.

Setup Environment for Data

Set up Data hosting environment after Data Scientist collect Data Set. This environment might be either in Local Computer, Cloud and On-premise, etc. Example of Cloud environment is - Azure Blob Storage, Azure SQL Database.

Also, sometimes Data could be in an unstructured format and it has to be converted as structured format before analyzing the same.

Please find Azure environments.

Azure Notebooks - Free subscription and supports multiple target environments and URL is - https://notebooks.azure.com/
Data Science Virtual Machine (DSVM) / Deep Learning Virtual Machine (IaaS solution) - Customized virtual machines that contain different tools preconfigures and preinstalled on Azure
Cloud-based Notebook VM
ML Studio - UI based tool from Microsoft - https://studio.azureml.net/
ML Service - PaaS solution

Setup Tools and Package

Setup Tools and Install Packages for data processing. Please find some Tools used in DS - Python, R, Azure Machine Language Learning Environment, SQL and RapidMiner, TensorFlow, etc. There are multiple packages available in each tool to process, manipulate and visualize the data. Panda and Numpy are some of the main packages with Python.

Identity Feature Category

Each feature in Data Set is broadly divided into either Categorical (for string type) or Continuous (for numeric type) Type. Categorical type is further divided into - Ordinal and Binary.

Feature Type	Description
Categorical (Or Nominal)	One or more Category but not quantitative values. Examples - Color with values Red, Green, Yellow, etc. and no numeric significance compare to others.
Ordinal	One or more categories but they are quantifiable compared to others. Examples - Bug Priority with values Low, Medium, High and Critical and these values have numerical significance compared to others.
Binary	Value of this feature should be either 1 or 0. Examples - Male or Female
Continuous	Value between negative infinity to positive infinity. Example: Age

Also, there are some other feature types like - Interval, Image, Text, Audio, and Video.

Data Explore and Preparation

The main steps in this phase are:

Data Explore

In this phase, Data Scientist familiarizes each feature in Data Set that includes: Identifies feature type like Categorical, Continuous, etc. How data spread and their distribution, identify if there is any relationship between two features, etc.

Data Scientist uses a different visualization tool to explore the Data and this step helps to fix - missing value, extreme value (outlier) and noisy value in Data preparation phase and determine proper scaling factor in model creation step as well. Here, we are mainly focused on two feature type - Categorical (assume for all string type) and Continuous (for numeric type). Let's discuss different approaches in Data exploring.

Data Exploring - Continuous Feature Type

Check Centrality Measure Data - Best value that summarizes the measurement of the specific feature
Check Dispersion Measure - How Data are spreading and its distribution of the specific feature

Mean	Average value but affected by extreme values
Median	Central value of Sorted List and it is not affected by the extreme values and of course Median is best centrality measure if data having extreme values.

Check Centrality Measure Data

Range	Difference between maximum and minimum values and if the difference is a small number, then Data packed closely otherwise data packed widely.
Box-Whisker Plot	This chart shows how data spread and represented with different Percentiles such as 25%, 50%,75%, minimum, and maximum. Also this chart show any lower and higher outlier values. Examples: For Age feature with 1000 values and if 25% is 32 meant, 25% values are below 32.
Variance	Represents how far each value is from the mean. Small variance meant Data closely packed otherwise widely packed.
Histogram	This Graph represents data spread across different bin and this would help to identify outlier and missing value. From Histogram, we can identify – Data Normal Distribution (skewness =0), Positively skewed distribution and Negatively skewed distribution.

Dispersion Measure

Data Exploring - Categorical Feature Type

Following options are used are for Categorical feature type data explore.

Total count for category
Unique count for category
BAR chart used to show individual category status

Also, Scatter Plot, Grouping, Cross Tab and Pivot Table and other statistical function can be used to explore both continuous and Categorical Data. There are plenty of frameworks and packages available in each Data Science language to practice any of the above-explained methods. Scatter plot (check the relation between two features) and line chart (for Time series data) are other options for Data explore and mainly used in Advance Data Analysis.

IDEAR and AMAR are custom build utilities for data exploration and these utilities provide clear Data Insight report for each feature based on its type.

Data Preparation

Based on Data Explore, some abnormal behavior can be identified like – Missing Data, Extreme Values (outlier) and Noisy data. These behaviors may impact the accuracy of Data science output and recommended to fix it before creating a model.

Remove Unwanted Feature - Remove features which make no impact in Data Analysis like Name, Roll Number, etc.
Fix missing values
- Remove row (observation) which contain the missing feature
- Imputation
  - Replace the missing value with a possible value using Mean, Median or Range
  - Replace the missing value with Dummy value
  - Replace with most frequent value and this would be more applicable for the category feature type
- Forward or Backward fill especially for Time series data
Fix Outlier (Extreme value) - Outlier value can be identified by Histogram, Box Plot or Scatter plot
- Remove row (observation) which contain the missing feature
- Imputation
  - Replace the missing value with a possible value using Mean, Median or Range
  - Replace the missing value with Dummy value
  - Replace with most frequent value and this would be more applicable for the category feature type
- Binning - Create discrete category from the continuous feature and this would help to place outlier value in any of the bins.
Fix Erroneous or noisy Values - Use the same approach of outlier fix
Text cleaning - Clean character for Enter, Tab in text data.

In some cases, missing or outlier value cannot be fixed and so Data Scientist creates two models - one with excluding missing data and outlier and other with all data and takes an average of both result as an output.

Feature Engineering

Process of transforming data to another better understanding feature to create better Predictive model. Feature Engineering is a crucial step in any of the Data Science journey and cannot implement without proper Domain knowledge.

Examples: Derive new feature called IsAdult from the Age feature, like IsAdult = 1 if Age > 18 else 0.

Categorical Feature Encoding

Most of the Machine learning algorithms work on numeric data and not Categorical Data. So Data Scientist needs to convert Categorical Type feature to Continuous feature.

The following methods can be used to convert continuous feature from a categorical feature based on feature Type.

Binary Encoding - Use if Category feature has only 2 values like - Male/Female, In this case, Data Scientist creates a new feature like Is_Male with value 0 or 1 accordingly.

Label Encoding - Use if the Category value type is Ordinal Type, i.e., feature values have clear intrinsic sort order. Examples: Software Bug severity - Low, Medium, High and Critical. Here, Data Scientist creates 1 to 4 for Low to Critical values.

One Hot Encoding - Used for Nominal Category Type. Examples - Color with values Red, Blue, and Green. And here, we have to create 3 features like - color_red, color_blue and color_green and fill values appropriately.

So far, we have covered 3 important lifecycles of Data Science - Business Understanding, Data Understanding and Data Preparation and now the Data are ready for model creation. I will discuss the other two lifecycles in the next article and requesting your valuable feedback.

Points of Interest

Familiarizing Data Science lingo like - Supervised learning, Feature Engineering, etc. is one of the most important aspects in Data Science journey.

In this article, I have covered most of the definitions as part of the first 3 lifecycles but recommended to read some other definitions such as Data Correlation, Data Wrangling, Data Factorization and Normalization and advanced statistical functions, etc.

Thank you for reading this article.

History

29^th April, 2019: First version
26^th November, 2019: Added Azure environments

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)