Skip to content

AlfonsoAgAr/IE-Notebook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

1. Why data matters

  1. Recognize why data is essential in business.
  2. Understand what data science is and draw conclusions to how it can be used to gain a competitive advantage.
  3. Understand the main components of utilizing data in accelerating process success: Introduction to big data.
  4. Classify the different uses and applications of big data
  5. Approach why data quality matters: Data cleaning, issues with data and the role of humans in the process
  6. Recognize Machine Learning as a subset of AI and how it can help extracting meaningful insights from data.

Main components in Big Data and ML in Business.

Why is data essential for all?

What is big data, skills required, main differences between small and large datasets, data quality

What are the pitfalls in storing and processing data?

Holistic view in ML, Big Data And Data Science in terms of Business

​ The uses ML, Big Data in Business.

​ The terminology used in Big Data and ML

​ Identify when data can add value as a solution.

The rising value of data

Data have become a strategic element of organizations. Treat data as a key business strategy rather than as a purely technical matter.

Data science can add value in various ways, especially in three key areas.

First: can help to improve operational performance, allowing a company to do more with less. Extracting the greatest possible values from the available resources by rationalizing costs, thus, becoming more efficient.

Second: Can be used to create additional sources of income. Many companies created new products based on data analytics or monetized their data by selling them to third parties.

Third: if a company is capable of adequately investing time, effort and resources in data science, it can obtain additional competitive advantages through the implementation of ML algorithms, which become ever more efficient as they learn from the data.

"There is no longer any doubt that data science has the potential to create value and competitive advantages"

Operational efficiency

There are two possible approaches to improving operational efficiency: reducing costs or increasing the income generated by the company's core business. Data science can be harnessed to find the optimal combination and modify it in response to external conditions. Such an approach would increase the company's sales volume while keeping the marketing deparment's budget under control.

Example: UPS

This shipping company has used data science and advanced geographical algorithms to optimize its delivery routes and carry out, predictive fleet maintenance.

Business transformation

A company's journey in data science begins with monitoring. Once it has taken that step, it is capable of extracting knowledge, which it can then incorporate into its operations. The most advanced companies end up transforming their entire business in order to sustain a data-based competitive advantage.

The trend, therefore, is to collect, structure, standardize, and transform data, to choose the most appropriate model, and most importantly, to design visualizations that facilitate more effective and valuable decision-making.

2. WHAT IS DATA SCIENCE AND HOW CAN IT HELP GAIN A COMPETITIVE ADVANTAGE?

Data Science: The foundations

Definition 1: Data science is a umbrella of techniques used when trying to extract insights and information from data.

Definition 2: Data science is the transformation of data using mathematics and statistics into valuable insights, decisions, and products.

The union between math and statistics, business knowledge and computer science.

Fundamentals questions about decisions:

What data do I have to make a decision?

What do I do with this data?

There is some similar concepts: Business Intelligence and Data Analytics, and his differences lies in the fact that BI helps in making business decisions based on past results, while DA helps in making predictions that are going to help in the future.

Business Intelligence, moving from data to action

Information become knowledge, and then, strategies.

  • Data: How many?, How much?, What are?
  • Information: Why are?
  • Knowledge: How do we?, What can we?
  • Therefore, a plan to guide actions or key decisions.

Business Analytics

BusinessAnalytics

Why now and not before it's the time of the Data and Business Analytics?

  1. Data explosion: 2.5 Billion GB data per day in 2012 (IBM), Mean of 1.7 MB of new information per second per person in 2020, Big Data
  2. Technology: Fast computer, Increased Storage Capacity, IoT, High Computational Power, New and Improved Programming Languages.

In synthesis, we can get accurate answers, changes how decision are made, and find trends easily.

Whats skills does a data scientist need?

Know the essentials skills for become a data scientist.

But, what is a data scientist? A data scientist is a persona responsible for cleaning and processing and analyzing data.

In summary a data scientist must have knowledge of statistics, algorithm design, programming, communication skills combining with deep pieces of understanding.

On the other hand, a data analyst must be able to identify problematic areas in data and find possible solutions. For example, spotting irregularities in data sets.

3. AN INTRODUCTION TO BIG DATA

Statistics: The backbone of data analysis

Statistics it's all about the choices overall, quantify the certainty and make statements that helps in decisions makes.

Definition: Statistics is a branch of mathematics concerned with the collection, classification, analysis, and interpretation of numerical facts, for drawing inferences on the basis of their quantifiable likelihood (probability of occurrence).

Statistics can be divided in three disciplines:

  1. Data Analysis: Gathering, display and symmary of data (i.e. descriptive statistics)
  2. Probability: The law of chance to quantify certainty
  3. Inferential Statistics: The science of drawing statistical conclusions from specific data, using a knowledge of probability

Start to make questions, no any, instead the right questions who, where or how. "The impact of asking the wrong question"

Exploring data: The variables

Data science is all about data which is define as any information collected about a particular subject.

Population vs Sample

For example, a study shows that Fox viewers are older than 54 in average. So, we need to prove if that statement is true or false. For our "experiment", we need 200 people whom are currently Fox viewers.

In this case the population is the complete set of all people Fox viewer and the sample is the set of 200 people.

In other words, population (N) includes all of the elements from a set of data, meanwhile a sample (n) consist of one or more observations from the population.

Getting the know data

Data come in various forms, it can be a dynamic (time-series), which means the data is measures the variable at different points in time or it can be a static (cross-section), which is data given in a particular situation in one point in time. Also can be univariate an multivariate.

Variables classification scheme can be:

  1. Dependent/independent
  2. Levels of measurements
  3. Categorical (Qualitative) vs Numerical (Quantitative)
    1. Categorical. Represent strings.
      1. Ordinal. Examples: Professor ranking, Level of satisfaction
      2. Nominal. Examples: Marital status, Eye color, or questions that can be responded with yes or no.
    2. Numerical. Represent floats or ints.
      1. Discrete. Example: Numbers of children, Defects per hour in items. Finite numbers of value, mostly from counting.
      2. Continuous. Example: Height, Weight. Any value within a given range, mostly from measurement.

Another facts that evolves the population vs samples.

Nomenclature: A measurable characteristic of a population is called Parameter, but a measurable characteristic of a sample is called a Statistic.

Notation: Greek letters are generally used to denote Parameters whereas Roman letters denote a Statistic

Summarize
  1. An experimental unit is defined as every item of interest upon which we collect data. E.g. A FOX viewer.
  2. Population is the set of all items of interest. E.g., all FOX viewers.
  3. A Variable represents a characteristic of each experimental unit, e.g. The age of each FOX viewer.
  4. A sample is a subset of the population of interest. E.g. a subset of FOX viewers.
  5. Statistical inference is an estimate or prediction, or even a generalization based on the information contained on a sample. E.g. The average age of the sample of FOX viewers, and the use of this average to estimate the population average age.
  6. Data can be static or dynamic over time. Data that varies over time is called time-series.
  7. Data can be univariate, i.e. containing one variable or multivariate or containing multiple variables.
  8. Variables can be dependent or independent.
  9. Variables can also be categorical or numerical. Categorical variables take on values that are labels or names. Numerical variables take on values that are numbers. E.g. The variable gender is categorical and the variable age is numerical.

What makes data 'Big'?

To contextualize, in just 60 seconds on the internet

  • 38 million WhatsApp messages are sent.
  • 3.7 million Google searches are made
  • 4.3 million YouTube videos are viewed

More than 90% of big data has been generated in just he past two years.

To understand the Big Data are five keys:

  1. Volume: The amount of data produced today is mind-boggling.
  2. Velocity: Data is generated and transmitted in real time.
  3. Variety: Data can be structured, unstructured, or semi-structured and comes in various forms.
  4. Veracity: Data is collected from multiple sources and it can be difficult to determinate how accurate they are, so when you work with Big Data you have to assume that not everything is perfect.
  5. Value (Most important feature): Data will do nothing for your business unless it is relevant.

It is essential to ensure that data actually generates knowledge. Processing and contextualization are therefore crucial intermediate steps. The most effective decisions are based on facts an data.

The common problems with companies using Big Data are related with make appropriate use.

  1. Silos: A company can become disconnected if its information is locked away in different departments.
  2. Information analysis: Companies need to learn to work with information throughout the company—and not just with dry numbers and figures.
  3. Finding talent: Workers qualified for these new roles are scarce.
  4. Getting started with big data: Some companies find it necessary to reorganize how they work.

Data must be memorable, it must be presented correctly. 90% of all information is transmitted visually. The data need to tell a history by itself.

Data-Driven Enterprise

"With the advent of the cloud and new technological solutions, big data has been democratized and now offers opportunities for companies of all sizes."

The Seven Principles of an Insight-Driven Enterprise

  • The first principle centers on defining our business objectives and focusing the data and our digital capabilities on a key point of our strategy to achieve them.
  • The second principle is geared toward technological capacity-building. Not all company technology platforms are up to date and prepared for what is coming, not all are connected to external data sources and equipped to absorb everything, and not all are in the cloud and ready to start capitalizing on this information. Consequently, we need to enable a technology platform for this new model, under the new paradigm of working with “unlimited volume and time.”
  • The goal of the third principle is to establish a suitable governance model. We must bear in mind that we are going to be handling third-party information, information from social media, information on positioning, customers’ personal data, and/or information on other business issues. Ensuring the privacy and security of how that information is processed and governed is thus extremely important, as is ensuring that the resulting ideas are launched in an environment of innovation and detecting when any of them fails in order to stop it in time. Not all ideas will yield value. In this environment, we thus have to be able to make mistakes quickly in order to learn quickly and change course. Without a defined governance model, the organization may react too late to possible missteps.
  • The fourth principle is to cultivate a culture that values data. Anyone at a company can use data to identify how to improve tasks, processes, and everyday decision-making.
  • The fifth principle is to think about the mode of consumption, which is also crucial, as the philosophy of using versus owning has changed. The new paradigm offers new ways to develop projects and make investments.
  • The sixth principle is to quantify how much using the information impacts the bottom line and the percentage of business generated through monetization by applying data-driven insights. This should become another business KPI.
  • Finally, the seventh principle is that the use of data-driven insights must be incorporated into the daily routines of everyone at the company and into every transaction. The real revolution lies in reinventing every business moment involving people, devices, and data.

BIG DATA APPLICATIONS

The applications of Big Data

Machine Learning and Big Data can support multiple business objectives. Use the big data to make data-informed decisions, to understand customer trends, to create customer-centric value propositions, and to automated processes.

Big Data + IoT + Artificial Intelligence = 4th Industrial Revolution

  • Data-Informed decisions. Example company: Google.
  • Understand customer trends. Focused sales and efficiency at work.
  • Create customer value proposition. Smartest products and services.
  • Automation. Do things betters and cheaper.
  • Data monetization.
  • Others

4. DATA QUALITY MATTERS

Problems with data

Issues with data:

Main data issues to consider is:

  • Noise: refers to data that can deform the conclusions with false data
  • Duplicates: refers to the same piece of information, can produce inconsistencies and confusions in some instances
  • Wrong records: refers to wrong entries to the data-set, example: human errors
  • Incorrect measurements: refers to error in entries for faulty recordings or just errors in the record method
  • Format errors: refers to data entering in the wrong format and not be able to process
  • Poor column naming: refers to uninformative naming or informal categorization in data-sets leading to confusion, considering that data can be used for other people or teams

Data assessment: Missing values and outliers.

Options to remedy this situation:

  • Most easy: discard void data
  • Fill the voids with average computed: it's important identify the pattern and relationship underlying the missing data in order to maintain as close as possible the original distribution of the values.

Software helping:

  • Wrangler

An Outlier is an observation or a measurement that is unusually higher or smaller in the data-set. Can be from

  1. The measurement is incorrectly
  2. The measurement comes from different population

How to detect:

There is a two types of fences: inner fences that are located at 1.5 times into a quartile range and the outer fences usually located at 3 into quartile range.

Lower inner fence = Q1 - 1.5 IQR

Upper inner fence = Q3 + 1.5 IQR

Lower outer fence = Q1 - 3 IQR

Upper outer fence = Q3 + 3 IQR

IQR = Q3 - Q1

Find Quartile:

Q1 = median of the n smallest values

Q3 = median of the n largest values

Analytic way:

First, find the subject of Q1 with (N+1)/4 formula if the result is decimal

i = INT, d = FLOAT;

Example, a population of 20:

(N+1)/4 = 5.25, and i = 5, d = 0.25.

and replace the value in Q1= Xi + d(Xi+1 - Xi)

In the same way find the subject for Q3.

Example, a population of 20:

3(N+1)/4 = 15.75, i = 15, d=0.75

Q3 = Xi + d(Xi+1 - Xi)

Manipulating data and the role of managers

How to lie with statistics

  1. Sample size: How large is the sample size.
  2. Sampling technique: Representative of population.
  3. Measures of centrality: Typify the datasets

Use correctly using the correct questions to allow to describe the correct way.

  • How do they know?
  • Is there data to support the claim
  • How large is the sample
  • Are there any missing values?

Be careful with hidden bias in the algorithms

The role of managers in data quality:

Need to be a team leader.

Must be able to motivate and know the roles of his team.

Acting as a facilitator.

What good managers do:

  • Instill a learning philosophy
  • Be involved as a leader in the analytical process
  • Ensure the team has access to the required resources
  • Ensure all interdisciplinary teams are aware of each other's work

5. MACHINE LEARNING

Machine learning, robots and business

THE GENESIS OF THE NEW DATA SCIENTIST

Location analytics, data analysis, and machine learning have given rise to a new professional profile that would have been inconceivable just a few years ago.

A new executive post called chief artificial intelligence officer has been created to oversee AI solutions and to address complex problems for which little information is available. The CAIO must have scientific skills—mainly in cognitive processes, prediction, simulation, etc.—plus a vision of engineering for the application of symbolic, constructionist, and hybrid paradigms.

Data scientists are a perfect fit for smart cities, since their job is to solve problems in various disciplines by analyzing data.

Machine learning

A branch of artificial intelligence (AI). The science of teaching computers to learn and behave like humans do, improving their learning over time autonomously by feeding them data and information in the form of observations & real-world interactions.

Types of analytics

  • Descriptive: Describe what happened, employed heavily across all industries.
  • Predictive: Focus on ML, Anticipate what will happen (probabilistic), employed in data-driven organizations as a key of source of insight.
  • Prescriptive: Provide recommendations on what to do to achieve goals, employed heavily by leading data and Internet companies.

Analytics scheme.

Descriptive (Foundational)

  • What happened in the past? // Focus on reporting and use static and interactive reports as tools and techniques.
  • What is happening now? // Focus on measuring key performance indicators, use dashboards and performance scorecards as tools.

Diagnostic (Operational)

  • Why did it happen and what are the relationship? Focus on trend analysis, situational analysis, root cause, cluster analysis and cause effect. Use data mining, modeling statistics, query tools, spreadsheets, OLAP tools and decision trees as tools and techniques.

Predictive (Insightful)

  • What will happen in the future? Focus on forecasting, probability assessment, risk management, prediction. Uses What-IF analysis, machine learning, predictive modeling, neural networks and data visualization as tools and techniques.

Prescriptive (Strategic)

  • How should we act in the future? Focus on scenario based planning, strategy formulation and simulation, option optimization. Discrete choice modeling, linear and no-linear programming and value analysis,

Infrastructure Needed and Types of ML

  1. Infrastructure:

    • Cloud Services
    • GPU
    • TPU
  2. Types of ML

    • Supervised Learning (Classification algorithms. Artificial Neural Networks, ANN. Deep Learning)
    • Unsupervised Learning (Clustering. Dimensional Reduction)
    • Reinforcement Learning (Q-Learning)

    Tips

    1. To select the right ML solution, you should try and draw the problem you are trying to solve.
    2. Words like predict, estimate, etc., should hint at supervised learning.
    3. Words like cluster, group and identify patterns hint at unsupervised learning.
    4. Reinforcement learning is suitable for dynamic environments, in which changes must be understood and factored into the algorithm.

About

Some notes of the course.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published