Data science course in chennai: Innovation in Data Analysis

Introduction

I have heard that it is necessary for data analysts to be creative in their work. But why? Where and how exactly is this creativity exercised?

At one extreme, one might think that a data analyst should be easily replaced by a machine. For a variety of data types and for various types of questions, there must be a deterministic approach to analysis that does not change. Presumably, this could be coded into a computer program and data could be entered into the program each time with a result presented at the end. To begin with, this would eliminate the notorious problem of the investigator's freedom. If there was substantial institutional knowledge of data analysis, this would be possible. How is it possible that each data analysis is so different than a human being is needed to design a solution?

Well, it is not true that each analysis is different. Many energy calculations, for example, are identical or very similar and can be automated to some extent. However, the exact way in which these energy calculations are used or interpreted can vary greatly from project to project. Even the same calculation for the same study design can be interpreted differently in different designs depending on the context. The same holds true for other types of analysis, such as the regression model or machine learning.

Innovation is necessary because of the constraints imposed on the analysis by context, resources and audience, all things that we can consider as "external" to the data. The context in which data is created, the resources (time, money, technology) available to perform the analysis and the audience to which the results will be presented, play a key role in determining the strategy that an analyst develops to analyze the data. The analyst will usually need to use some creativity to execute a strategy that produces useful results.

The role of context

The context of a problem has a great influence on how we frame an issue, how we translate it into a data problem, and how we collect data. The context also allows us to answer questions about why the data seems to be the way they do. The same number for the same type of measurement may have different interpretations based on the context.

Lost data

Missing data is present in almost all datasets, and the most important question a data analyst can ask when encountering missing data is "Why are data missing?" develop an appropriate strategy for dealing with lost data (ie do nothing, imputation, etc.) But the data itself often provides little information about this mechanism; often the mechanism is coded out of data, possibly not even written, but stored in the minds of the people who originally collected the data.

Perform a two-arm clinical trial with an experimental treatment and a placebo. Sometimes with experimental treatments, there are side effects and people leave the study (or even die) because they can not cope with the side effects. The result is more data absent in the experimental arm of the study than in the placebo arm. Now, the data will reveal a differential in the rate of lack of data between the arms, since it will be clear that the treatment arm has a higher rate. But the data will not reveal the exact reason they abandoned it. Depending on the nature of the study and the question asked, there may be different ways of dealing with this problem. The imputation can be viable or maybe some kind of matching scheme. The exact choice of how to proceed will depend on what external data is available, how much data is missing and how results will be used, among many other factors.

Another example could be in the analysis of particulate air pollution data. Monitors administered by the US EPA They usually take measurements once every six days. The reason is that it is expensive to process the filters for the PM data, so the 1-day program at 6 is a compromise designed to balance the cost with the amount of data. Of course, this means that 5 out of 6 days are missing from the data records, although the shortage has been deliberately introduced. Again, the data does not tell why they are missing. One can easily imagine a scenario in which the monitor does not record data when PM values are too high or too low, a kind of informational failure. But in this case, missing data can be ignored and, in general, do not have much impact on subsequent modelling. In fact, imputation can be detrimental because it does not offer many benefits, but it can greatly increase uncertainty.

In both cases, the job of the data analyst is to assess the situation, examine the data, obtain information about the context and why the data is missing (from a field expert), and then decide the appropriate path to be followed. Even with these two scenarios, there is no generic route forward.

The role of the public

The public is another important factor that mainly influences how we analyze the data and present the results. A useful approach is to think about which end products should be produced and then work backwards to produce the result. For example, if the "target audience" is another algorithm or procedure, the exact nature of the output may not be important as it can be properly fed into the next part of the tube. A priority will be established to ensure that the output is machine readable. Moreover, the interpretability may not be so heavy because no human being will see the result of that part. However, if a person will see the results, they may want to focus on a modelling approach that allows the person to reason about the data and understand how the data reports the results.

In an extreme case, if the public is another data analyst, you may want to do a relatively "light" analysis (maybe just a pre-processing), but then prepare the data in such a way that it can be easily distributed to others to perform your job. . analysis. This can be in the form of an R package or a CSV file or whatever. Other analysts can not worry about their visualizations or fantasy models; They would rather have the data for themselves and get their own results.

A data analyst should make a reasonable assessment of the needs, history, and preferences of the public to receive the results of the analytical data. This may require some creative puzzles. If the public is available to the analyst, the analyst should ask questions about how best to present the results. Otherwise, reasonable assumptions must be made or contingencies can be prepared (eg, backup slides, appendices) for the presentation itself.

Features & Tools

The data analyst will probably have to work under a set of resource constraints, placing limits on what can be done with the data. The first and most important restriction is probably time. Only a lot of things can be tested in due time, or some analysis can take a long time to complete. Therefore, commitments may have to be made unless more time and resources can be negotiated. Tools will be limited because certain combinations of models and software may not exist and there may not be time to develop new tools from scratch.

A good data analyst should estimate the time available and determine if it is sufficient to complete the analysis. If resources are insufficient, the analyst must negotiate for more resources or adapt the analysis to fit the available resources. It is almost certain that creativity will be needed when there are severe resource constraints in order to extract as much productivity as is available.

Data science course in chennai

Wednesday, October 10, 2018

Innovation in Data Analysis

3 comments:

Merits & Demerits of Data Analytics

Report Abuse