Monday, November 26, 2018

The Cart Before the horse in Data Science Project: Return to the Basic

Introduction

A data Science project can easily go through several months, affect a large number of people and end up incorporated into complex network systems expensive with difficult to separate dependencies. For this reason, it is best to load as much work as possible from the design level and conceptual system, so that we can avoid errors that are difficult to reverse afterwards.


Since the first iteration from left to right is complete, we can move on to where data possibilities and tools become important parts of our design decisions and data-driven implementation approaches.

The guide proposal is structured in six interconnected modular building blocks, but can be summarized as:

1. The challenge of the project guide, which should act as the purpose for which to ultimately evaluate success.

2. The main issues that we want to respond with this project.

3. Indicators (often can be expressed in the form of algorithms), which will help us to find a data-based answer to the questions.

4. Data visualization solutions that will help us communicate the indicators of changing agents / decision-makers / stakeholders.

5. The analytics and data needed to produce and implement infrastructure and visualization indicators.

6. The data used to feed our indicators and views.

We should strive to clarify each of the six building blocks and interfaces as early as possible in the project. As you get new information during the course of design and implementation of the project, it should be opened how this information affects the overall alignment and scope of the project.

In general, changes in the first building blocks have more cascading consequences that later change in the list. For this reason, we should pay special attention to the definition of the global challenge and the key issues at the beginning of the design process.

In contrast, conventional approaches to scientific data for modelling logical steps in a project often begin behind. An example of this is the model of "data collection => data analysis => interpretation", where only the end of the pipe can understand the utility and potential value offered by the project.

APPLY THIS GUIDE TO PRACTICE

Each data science project is different and there is a special solution or a single solution. However, collectively asking and answering the right questions from the start can be very helpful. Answering these questions helps to ensure that everyone is on the same page and dangerous exposing hidden assumptions, which can be clearly erroneous or not shared among the stakeholders. In this context, for the first four crucial building blocks, we can identify some important issues to be discussed before any implementation work:

1) Challenge

• Description of the challenge: First, taking time to develop a clear and precise formulation of the challenge is compelling and shared among stakeholders. The formulation of the challenge is such that we can come back at the end of the project and easily determine whether or not the project could help solve the challenge.

• Identification of key stakeholders: List the key stakeholders of the challenge and briefly describe their roles. The list may include employees from different departments, customers, suppliers, regulators, etc.

• Description of the "pain" that justifies the investment in resolving this challenge: Begin by specifying the current situation (including tools, methods, processes, etc.) currently used and their limitations. Next, describe the desired situation (described as the ideal solution for the challenge)

• The total net value expected to meet this challenge in $: Assuming you can achieve an "ideal" solution, make an effort to quantify in monetary terms the value that the organization can capture to meet this challenge. This should be expressed as the incremental value of change from the current situation to the desired situation without regard to development costs. The purpose of this is to provide a context for the development budget and the total maximum effort that can be justified to meet the challenge.

• List your assumptions: make explicit what is behind your assessment of the desired situation and calculate the expected incremental value to move to the desired location

2) questions

• Description of each question: This is where we define each of the key issues, whose answers are inputs needed to address the identified challenge.

Questions should be described so that they can be answered using data-based algorithms. Typical questions may contain one or more of the following data dimensions:

- Where (geographical / local)

- When the time

- What (object / entity)

- Who (subject)

- how (process)

Example question: What are the top organizations I should first address to develop a component for the "Y" product and where are they located?

• Purpose of each question: Is the question descriptive, predictive or prescriptive? What is the description, prediction or prescription that the question seeks?

• Classification of issues: classify the issues according to the overall importance of the project so that, if necessary, they can be prioritized.

3) Indicators

• Description of each indicator: Indicators are algorithmic solutions to the questions posed. Although at an early stage we may not be able to define a complete algorithm, we can express it at a higher level of abstraction, indicating the kind of algorithmic solutions that are most useful and achievable. For example, two indicators are:

- Collaboration algorithm that provides a classified list of potential collaborators for the company [X] given the geographical, technological and relational proximity

- Capacity mapping algorithm that identifies the main technology clusters in a given sector based on the co-occurrence of R & D-related key terms

4) data visualizations.

• Define target data views: Before writing any line of code or using any data, those who should benefit from the deliverables that the data science project will produce can provide critical information about the data visualization formats that would be most Useful. useful for them.

A simple but powerful approach is to ask these target users to produce sketches with the types of visual representations they believe is the best way to communicate the results that the indicators in point 2 should produce.

Other important features that should be taken into account when defining data visualizations include simplicity, familiarity, intuition, and adaptation to the dimensions of the issue.

• Characteristics of each data visualization: the characteristics of the data visualization solution include:

- The degree of interactivity required.

- The number of dimensions that must be displayed simultaneously.

• Purpose of each data visualization: The purpose of a visualization can be:

- Exploration: provide free means to deepen the data and analyze the relationships without previously defining specific points of view or issues

- Narrative: Data visualization is designed to deliver a predefined message convincingly to the target user and aims to provide robust arguments based on data

- Synthesis: The main purpose of visualization is to integrate multiple angles of a complex data set. Key features of data condensation in an intuitive and accessible format.

- Analysis: Visualization helps to divide a generally large and complex dataset into smaller parts, characteristics or dimensions that can be treated separately

WRAP -UP

Data science projects suffer from a tendency to overemphasize the elements of analysis, visualization, data, and infrastructure of the project too early in the design process. This means spending very little time in the early stages, working on a joint definition of the challenge with project stakeholders, identifying the right questions and understanding the kind of indicators and visualizations that are necessary and usable to answer those questions.

The guide presented in this paper sought to share the learning points derived from the work with a new structure that helps guide the early stages of data-driven projects and make explicit the interdependencies between design decisions. The framework integrates design elements and systemic thinking, as well as practical project experience.

The main components of this framework were developed during the work produced for EURITO (Grant Agreement No. 770420). EURITO is a European Union research and innovation framework project Horizon 2020, which aims to build "relevant, inclusive, timely, reliable and open innovation indicators", taking advantage of new data sources and advanced analysis.

3 comments:

  1. I’d love to be a part of group where I can get advice from other experienced people that share the same interest. If you have any recommendations, please let me know. Thank you.
    Data Science Training in Bangalore
    Data Science Training institute in Bangalore

    ReplyDelete
  2. Hey there! I know this is kind of off-topic, but I’d figured I’d ask. Would you be interested in exchanging links or maybe guest authoring a blog post or vice-versa? My blog goes over a lot of the same topics as yours, and I believe we could greatly benefit from each other. If you happen to be interested, feel free to shoot me an e-mail. I look forward to hearing from you! Great blog by the way!
    Selenium training in chennai
    Selenium training in bangalore
    Selenium training institute in bangalore

    ReplyDelete
  3. Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.

    http://datasciencecourseinchennai.blogspot.com/2018/11/the-cart-before-horse-in-data-science.html

    ReplyDelete

Merits & Demerits of Data Analytics

Definition:  The data analysis process was concluded with the conclusions and/or data obtained from the data analysis. Analysis data show...