Monday, November 26, 2018

The Cart Before the horse in Data Science Project: Return to the Basic

Introduction

A data Science project can easily go through several months, affect a large number of people and end up incorporated into complex network systems expensive with difficult to separate dependencies. For this reason, it is best to load as much work as possible from the design level and conceptual system, so that we can avoid errors that are difficult to reverse afterwards.


Since the first iteration from left to right is complete, we can move on to where data possibilities and tools become important parts of our design decisions and data-driven implementation approaches.

The guide proposal is structured in six interconnected modular building blocks, but can be summarized as:

1. The challenge of the project guide, which should act as the purpose for which to ultimately evaluate success.

2. The main issues that we want to respond with this project.

3. Indicators (often can be expressed in the form of algorithms), which will help us to find a data-based answer to the questions.

4. Data visualization solutions that will help us communicate the indicators of changing agents / decision-makers / stakeholders.

5. The analytics and data needed to produce and implement infrastructure and visualization indicators.

6. The data used to feed our indicators and views.

We should strive to clarify each of the six building blocks and interfaces as early as possible in the project. As you get new information during the course of design and implementation of the project, it should be opened how this information affects the overall alignment and scope of the project.

In general, changes in the first building blocks have more cascading consequences that later change in the list. For this reason, we should pay special attention to the definition of the global challenge and the key issues at the beginning of the design process.

In contrast, conventional approaches to scientific data for modelling logical steps in a project often begin behind. An example of this is the model of "data collection => data analysis => interpretation", where only the end of the pipe can understand the utility and potential value offered by the project.

APPLY THIS GUIDE TO PRACTICE

Each data science project is different and there is a special solution or a single solution. However, collectively asking and answering the right questions from the start can be very helpful. Answering these questions helps to ensure that everyone is on the same page and dangerous exposing hidden assumptions, which can be clearly erroneous or not shared among the stakeholders. In this context, for the first four crucial building blocks, we can identify some important issues to be discussed before any implementation work:

1) Challenge

• Description of the challenge: First, taking time to develop a clear and precise formulation of the challenge is compelling and shared among stakeholders. The formulation of the challenge is such that we can come back at the end of the project and easily determine whether or not the project could help solve the challenge.

• Identification of key stakeholders: List the key stakeholders of the challenge and briefly describe their roles. The list may include employees from different departments, customers, suppliers, regulators, etc.

• Description of the "pain" that justifies the investment in resolving this challenge: Begin by specifying the current situation (including tools, methods, processes, etc.) currently used and their limitations. Next, describe the desired situation (described as the ideal solution for the challenge)

• The total net value expected to meet this challenge in $: Assuming you can achieve an "ideal" solution, make an effort to quantify in monetary terms the value that the organization can capture to meet this challenge. This should be expressed as the incremental value of change from the current situation to the desired situation without regard to development costs. The purpose of this is to provide a context for the development budget and the total maximum effort that can be justified to meet the challenge.

• List your assumptions: make explicit what is behind your assessment of the desired situation and calculate the expected incremental value to move to the desired location

2) questions

• Description of each question: This is where we define each of the key issues, whose answers are inputs needed to address the identified challenge.

Questions should be described so that they can be answered using data-based algorithms. Typical questions may contain one or more of the following data dimensions:

- Where (geographical / local)

- When the time

- What (object / entity)

- Who (subject)

- how (process)

Example question: What are the top organizations I should first address to develop a component for the "Y" product and where are they located?

• Purpose of each question: Is the question descriptive, predictive or prescriptive? What is the description, prediction or prescription that the question seeks?

• Classification of issues: classify the issues according to the overall importance of the project so that, if necessary, they can be prioritized.

3) Indicators

• Description of each indicator: Indicators are algorithmic solutions to the questions posed. Although at an early stage we may not be able to define a complete algorithm, we can express it at a higher level of abstraction, indicating the kind of algorithmic solutions that are most useful and achievable. For example, two indicators are:

- Collaboration algorithm that provides a classified list of potential collaborators for the company [X] given the geographical, technological and relational proximity

- Capacity mapping algorithm that identifies the main technology clusters in a given sector based on the co-occurrence of R & D-related key terms

4) data visualizations.

• Define target data views: Before writing any line of code or using any data, those who should benefit from the deliverables that the data science project will produce can provide critical information about the data visualization formats that would be most Useful. useful for them.

A simple but powerful approach is to ask these target users to produce sketches with the types of visual representations they believe is the best way to communicate the results that the indicators in point 2 should produce.

Other important features that should be taken into account when defining data visualizations include simplicity, familiarity, intuition, and adaptation to the dimensions of the issue.

• Characteristics of each data visualization: the characteristics of the data visualization solution include:

- The degree of interactivity required.

- The number of dimensions that must be displayed simultaneously.

• Purpose of each data visualization: The purpose of a visualization can be:

- Exploration: provide free means to deepen the data and analyze the relationships without previously defining specific points of view or issues

- Narrative: Data visualization is designed to deliver a predefined message convincingly to the target user and aims to provide robust arguments based on data

- Synthesis: The main purpose of visualization is to integrate multiple angles of a complex data set. Key features of data condensation in an intuitive and accessible format.

- Analysis: Visualization helps to divide a generally large and complex dataset into smaller parts, characteristics or dimensions that can be treated separately

WRAP -UP

Data science projects suffer from a tendency to overemphasize the elements of analysis, visualization, data, and infrastructure of the project too early in the design process. This means spending very little time in the early stages, working on a joint definition of the challenge with project stakeholders, identifying the right questions and understanding the kind of indicators and visualizations that are necessary and usable to answer those questions.

The guide presented in this paper sought to share the learning points derived from the work with a new structure that helps guide the early stages of data-driven projects and make explicit the interdependencies between design decisions. The framework integrates design elements and systemic thinking, as well as practical project experience.

The main components of this framework were developed during the work produced for EURITO (Grant Agreement No. 770420). EURITO is a European Union research and innovation framework project Horizon 2020, which aims to build "relevant, inclusive, timely, reliable and open innovation indicators", taking advantage of new data sources and advanced analysis.

Sunday, November 25, 2018

How can Deep Learning solve the problem of climate change?


Introduction

As a pioneer in the fight against global climate change, Germany is investing more and more in renewable energy, especially in wind energy. With about 300 new turbines from 2016 to 2017, North Rhine Westphalia is one of the major federal states in the construction of new wind turbines. To assess the potential of wind power and the planning of new turbines, it is essential to track the spatial location of wind turbines along with their type and to combine them with information on the characteristics of average wind speed.

In the present case of use, a methodology is described that can locate and segment wind turbines in satellite images. The implemented neural network architecture is called U-Net and is the leading standard for image segmentation. The output is a pixel-level prediction of the likelihood of a pixel belonging to a wind turbine. The deep learning structure was trained to predict wind turbine polygons in 280,000 satellite images covering the entire region of North Rhine-Westphalia. The output has been transferred to the ArcGIS Geographic Information System and can be accessed online through multiple devices. The wind turbine layer can be used for comprehensive analysis of wind power potentials and spatial planning of turbines.

In view of the growing scarcity of fossil fuels, renewable energy sources are becoming increasingly important economically, socially and politically as an efficient and ecological way of generating electricity.

The objective of the project was to support the Federal Ministry of North Rhine-Westphalia in generating regional registers of sites and types of wind turbines to guide national energy producers in the process of spatial planning of new power plants.
A convolutional neural network (CNN) was formed to identify and segment wind turbines and wind farms on the ground, based on satellite images. Output wind turbine polygons can be fed into Geographic Information Systems (GIS) and can be enriched with current wind data to efficiently monitor current wind energy.

Data provided

The satellite imagery used in this project contained 280,000 mosaics of images provided by Esri (the market leader in geoinformation systems) from World Imagery. The images cover the border area of the federal state of North Rhine-Westphalia and have an area of 1 km2 each. For each mosaic of images, the corresponding geographic metadata was crawled. For the training data set, 500 images were selected and 200 for the validation data set, including 200 wind turbines of different types and in different terrestrial coverage situations. For both sets, wind turbine polygons were created.

Applied methods

First, an image pre-processing was performed to normalize the satellite images for different levels of brightness, saturation and contrast. Then the training and validation data had to be generated by the visual location of wind farms and turbines in ArcGIS Pro (Esri's professional GIS tool). The localized turbines were marked and converted into dereferenced polygons. After combining the resulting polygons with the corresponding mosaic of images, they became picture masks. The mask classifies whether or not an image pixel belongs to a wind turbine. This serves as the desired classification scheme for the developed artificial neural network. The deep learning framework is based on a U-Net architecture, which has proven to work very well for segmentation tasks with a low amount of training data. Segmentation performance was tracked using the Jaccard index, which is an intersection in a union measure. The training was calibrated to achieve maximum accuracy in the validation set to avoid model overfitting. The final layer of the neural network generates an image mask with a pixel-level prediction of the likelihood of a pixel belonging to a wind turbine.

Challenges

The first challenge was to create training functions and generate wind turbine polygons within the 700 satellite images. The unsupervised collation application, specifically a K-Means colour grouping, has aided in pattern recognition and polygon shape extraction. Another time-consuming challenge was to detect false positives in the network, ie recognized image segments that were falsely identified as wind turbines, such as roads or aircraft with branches. More training periods were required to train the neural network to differentiate between objects with a similar appearance.

Project result

Using the deep learning model developed, a regional register of wind turbines was successfully created for the state of North Rhine-Westphalia. In total, about 3,300 wind turbines have been identified in satellite imagery.

This record was also captured as a layer in ArcGIS Pro and is now available as map material within the software, showing the location of all wind turbines identified as polygons of their shapes. As a next step, the model can also be applied to other German states or even create a global register of wind turbines. The wind turbine layer can be combined with current wind speed data to monitor wind power generation and the average wind speed layers to support spatial planning of new wind turbine installations.

Other applications

The deep learning model developed has already been used in other projects within the field of satellite image segmentation. Based on the satellite images provided, a stylized map material was created. The neural network was trained to detect different objects and types of ground cover in satellite images, such as roads, trees, forests, vehicles, buildings, rivers and agricultural fields.

Outlook
Satellite imagery has multiplicative fields of application and can be used to gain a better understanding of other domains, eg to help identify natural resources more easily, to visualize and monitor climate or vegetative changes or to represent the impacts of natural disasters with accuracy. However, these things were mostly achieved through manual or semi-automatic methods. Detection of artificial intelligence functions and satellite images can contribute significantly to these geographic applications.

To getting expert-level training for Data Science Training in your location –Data Science Training in Chennai Data Science Training in Bangalore Data Science Training in Pune | Data Science Training in Kalyan Nagar Data Science Training in marathahalliData Science Training in Anna Nagar | Data Science Training in OMR Data Science Training in Rajaji Nagar Data Science Training in btmData Science with Python training in Chennai | Data Science with R Training in chennaiData Science with SAS Training in ChennaiData science Training in Velachery | Data science Training in Tambaram | Data science training in jayanagarData Science Training in PuneData Science Training in Kalyan Nagar | Data Science Training in Chennai |Data Science training in Bangalore | Data Science training in Chennai|Data Science Training in electronic city|Data Science Training in Indira NagarData Science Training in Marathahalli Data Science Training in BTM layout For getting online training | Data Science online Training |Data science training in USA

Saturday, November 24, 2018

Why is ethics essential in the big data area?


Introduction

Digital technology and the rise of massive data have led to continued doubts about maintaining integrity and marketing in business. Business leaders are experimenting with a new paradigm in which everyone must take into account important issues such as privacy, monitoring, and how we use data for personal use and benefit.

Despite the global proliferation of technology, technological knowledge among consumers is deteriorating. As a result, there is a greater need to implement ethical protections that ensure consumer privacy. With the evolution of big data and digital marketing, it is more important than ever to keep the values that align companies with ethical standards.

Ethics covers how people react to moral decisions. Transparency, as well as how companies use data and influence their customers, are key components of standards that promote ethical corporate behaviour. For businesses, there is a need for clearly defined ethical policies that align with company values and behaviours and allow team members to make consistent and exemplary moral decisions.

As is often the case, technology preceded ethics. In some cases, this has resulted in marketing initiatives that place the pursuit of profits in the interests of consumers. Fortunately, industry leaders are recognizing the need to protect consumers as innovative technologies, such as big data, take the global assault market.

Social media is done the right way

In addition to providing information to users of fake computers, users of social networks can disclose information that their employers or agencies, such as the Internal Revenue Service (IRS), can view less positively. Many consumers post on social networks and provide detailed information about their lives so that everyone can see them. These users generally do not consider who can view and track your personal information.

Netiquette is important for companies that have a presence on social networks. Just as it is rude to look at people, it is considered rude to "chase" or observe pages of social networks without attending. Companies are not immune to this basic principle of good cybernetics. Companies that engage with consumers increase their brands and gain authority over time.

Originality is another important aspect of the ethical behaviour of social networks. Whenever employees post an idea or image on behalf of the organization, it is important to give credit to the creator of that intellectual property. Companies that do not comply with this rule can be claimed by consumers and face harmful credibility charges.

By sharing the content of followers, brands can generate goodwill among the online audience. This is a great way to establish relationships with social media users that define trends, called influencers. When companies publish content created by their audience, they are more likely to reciprocate the favour and provide endorsements and coveted online references.

Maintain professional ethics

Behind the scenes, companies today can collect huge amounts of detailed consumer data. This is how problems arise in collecting and exchanging people's data. In such cases, business leaders must ensure the safe and ethical handling of consumer information. It is vital not to trample the consumer's privacy rights by promoting a good or service. Companies should also be careful not to manipulate public opinion about their brand but should view negative publications as opportunities to solve consumer problems and establish relationships.

Some of today's business leaders may get lost in the technical aspects of digital marketing. With work, you can move to the top of the search results and build credibility online in the short term. However, this credibility will not last long unless it is also complemented by ethical practices. Social networks provide consumers with a powerful platform to express their opinions. As a result, business leaders must implement ethical frameworks and profit strategies to ensure long-term credibility and growth.

Walking through the moral ground

As a general rule, ethical marketing involves activities that do not cause negative or unsatisfactory results to consumers. Consumers can express their dissatisfaction very easily in social media channels, resulting in a loss of confidence in a particular brand. For large companies that can implement campaigns to counter this result, this is a minor problem. However, a negative public protest may cause a small or medium-sized company to close its doors.

Today's business leaders must think ethically about ethics, developing new and exciting ways to take advantage of the latest technology innovations. By formally establishing ethical standards, business leaders can strike a balance between how their organizations use online resources and the protection and privacy of their consumers.

To getting expert-level training for Data Science Training in your location –Data Science Training in Chennai Data Science Training in Bangalore Data Science Training in Pune | Data Science Training in Kalyan Nagar Data Science Training in marathahalliData Science Training in Anna Nagar | Data Science Training in OMR Data Science Training in Rajaji Nagar Data Science Training in btmData Science with Python training in Chennai | Data Science with R Training in chennaiData Science with SAS Training in ChennaiData science Training in Velachery | Data science Training in Tambaram | Data science training in jayanagarData Science Training in PuneData Science Training in Kalyan Nagar | Data Science Training in Chennai |Data Science training in Bangalore | Data Science training in Chennai|Data Science Training in electronic city|Data Science Training in Indira NagarData Science Training in Marathahalli Data Science Training in BTM layout For getting online training | Data Science online Training |Data science training in USA

Thursday, November 22, 2018

Data science on a Chromebook


Introduction

About nine months ago, I announced that I was trying to run a Chromebook experiment for the second time. At first, I thought it would be a short-term experiment just to see if it was only possible to run with a Chromebook. But in an interesting twist, I got used to it and worked exclusively on a Chromebook in the last few months since the experiment began.

I define the following requirements:

1. I could only use Chrome OS without installing/booting Linux

2. I could not use another computer for any task

3. It had to be "totally cloudy" in the sense that I had no additional hardware

One of the reasons I did this was that I wanted to see if it was possible to be a data scientist in operation/day to day without using an expensive laptop. This is part of a larger experiment that I just started on how to democratize education in data science.
I'm not going to go into extreme detail about how I set it all up here (more on that in a second), but I thought I'd describe the setup of my Chromebook I've been using in the past few months.

I have used two Samsung Chromebook Plus computers, one of which I stay at home and the other at work. One of the best parts of Chrome's operating system requirements is that it means that from the user's point of view, everything is always synchronized. I disconnect from the computer at home, I come to work, I connect and it's like I'm on the same computer.

I thought I would simply go through a high level of software that I am using to keep everything running.

• Google Presentations for Presentations - (Cost: Free) Most of the time, this was very easy and it's a smooth Powerpoint transition. One thing I found very useful is the Chromebook plus laser pointer mode to highlight things on the screen when it is presented. I also discovered that since they are using USB-C adapters, I can participate in communism with Apple users. I had to decipher the screen mirror menus in Chrome OS, but after that it was easy.

• Google Docs / Paperpile for Recording - (Cost: Free) This works great and has been my workflow as described in my book since before beginning the experience with the Chromebook.

• DocHub to sign things: (Cost: $ 4.99 / month / charged annually) I often have to "sign" a document by adding my electronic signature. I used the note function to create a jpeg of my signature. So I can send the file to Docub

• On the back to write latex - (Cost: Free or $ 10 / month / billed annually) This is not necessary for all data scientists, but has some interesting features, including when I can write a grant and people see that

• Gmail by email - (Cost: free) This is pretty obvious.

• Google datasheets: (cost: free) This is an option I've been making before switching to Chromebooks. The Google sheets R package lets you do all sorts of interesting things with Google Spreadsheets.

• Digital Ocean for Rstudio - 

_ (Cost: US $ 20 / month) __ Set up a Rstudio server and run it remotely in Ocean Digital. I currently use the $ 20 / month option, but sometimes I climb up or down as needed. One great thing about the coupled version of the software is that I can pause the instance, expand the computing infrastructure, restart and everything is as I left it, but with more computational power. So I can use this for a few hours as needed and scale again. I use the terminal in Rstudio for most of my code management / etc. on Github.

• Google Hangouts for video conferencing - (Cost: free) This is the default, but honestly, I would like to have a better option. I often find it tricky and slow to work with but it's still better than Skype. I would be open to suggestions on that front.

• Communication gap (Cost: $ 6.67 / month) Several different teams here at JHU and across the country use Slack for group communication. I use it through the web browser, although Chromebook Plus allows you to install Android applications.

• Google Music to listen to music/podcasts (Cost: $ 10 / month) This is an unnecessary expense, but I like to have something to listen to while working.

• Tweetdeck for twitter - (Cost: free) I have some accounts that I manage and do through the browser. Most of the time this works very well.

Therefore, my total monthly cost reaches approximately $ 35 per month for various cloud services. At first, doing this was like writing a Haiku. I could still write, but the limitations made me think a lot about how I did things. But after a while, I was so accustomed to the way it seems natural to me and I did not lose my Apple products (which are really expensive).

The biggest headaches were:

• Wi-Fi connectivity issues: not as great as I thought, most places have wifi where I work and most are fine. When I have problems, I leave my phone.

• Wifi blocking my OD server: this has been a headache. I think if I have a custom domain for the web server and I do not use the IP address, I could avoid it. When I have problems, I leave my phone.

• HTTR and rstudio on the server: when I need to login to sites, I have problems, but if I set httr_oob_default == TRUE (documentation here), the OAuth process generates a code that can stay on my server.

Besides, it's been pretty simple to do just about everything I need. Stay tuned because this experiment inspired a broader effort we're making with Chromebooks here at the JHU Data Science Lab. To learn about this effort as you progress, subscribe to our weekly newsletter and be the first to see new ads.

To getting expert-level training for Data Science Training in your location –Data Science Training in Chennai Data Science Training in Bangalore Data Science Training in Pune | Data Science Training in Kalyan Nagar Data Science Training in marathahalliData Science Training in Anna Nagar | Data Science Training in OMR Data Science Training in Rajaji Nagar Data Science Training in btmData Science with Python training in Chennai | Data Science with R Training in chennaiData Science with SAS Training in ChennaiData science Training in Velachery | Data science Training in Tambaram | Data science training in jayanagarData Science Training in PuneData Science Training in Kalyan Nagar | Data Science Training in Chennai |Data Science training in Bangalore | Data Science training in Chennai|Data Science Training in electronic city|Data Science Training in Indira NagarData Science Training in Marathahalli Data Science Training in BTM layout For getting online training | Data Science online Training |Data science training in USA

Merits & Demerits of Data Analytics

Definition:  The data analysis process was concluded with the conclusions and/or data obtained from the data analysis. Analysis data show...