Data science course in chennai: Responsibility of Resources in Data Analysis

Introduction

When you learn about data analysis at school, you do not hear much about the role that resources (time, money, and technology) play in the development of analysis. This is a conversation that often has "in the hallway" when talking about teachers or senior mentors. But available resources play an important role in determining what can be done with a given question and a set of data. It is tempting to think that the situation is binary: either you have enough resources to do the "right" analysis or you simply do not do the analysis. But in the real world, there are some shades of grey between these two endpoints. There are many situations in data analysis where the optimal approach is not feasible, but nevertheless, it is important to do some sort of analysis. So a key skill for a data analyst to master is the ability to reconcile conflicting ideas while producing something useful.

All reviews must deal with the limitations of time and technology and this often determines the plan for what can be done. For example, the complexity of the statistical model used may be limited by the analyst's available computing power, the ability to buy more computing power, and the time available to perform complex Monte Carlo Markov chain simulations. The analysis needed tomorrow will be different from the analysis needed next week. However, the only difference between the two is the time available to get the job done.

Key features of time, money and technology have different effects on how a data analysis ends:

• Time. Time generally serves as the biggest constraint and obviously it is money related. However, even if the money is plentiful, you will not be able to save more time if there are none available. Complex analyzes often involve many separate parts, and complex data must be validated, verified, and interrogated before confidence can be placed in the results. All this takes time and having less time leads to doing less of all these things. In the same way, some analysis may require the time of several people if a person can not adjust everything in their schedule. If multiple people are not currently available, this will change the nature of the analysis performed.

• Technology. I use the word "technology" in general to refer to both computer resources and statistical "resources". Some models may be more ideal than others, but the characteristics of the data set (such as their size) may prevent them from being applied. A better analysis can be done with more computing power, but a constraint on the computing power available will determine which models are adapted and how much additional work is done. Technological restrictions may also be related to the audience receiving the analysis. Depending on how sophisticated the audience is, one can adjust the technology applied to make the analysis.

Approaches

Perhaps the approach is the oldest tool that statisticians have in their toolbox to deal with resource constraints. Generally, it is easy to note which is the exact or ideal solution to a problem, but the computational load makes it difficult to calculate this solution. For example, many Bayesian calculations require the computation of complex and high-dimensional integrals that were impossible before the invention of the digital computer. For complex nonlinear solutions, a classic trick is to use a linear approximation and perhaps combine it with an assumption about asymptotic normality.

In most cases where the calculation was intractable, statisticians resorted to (asymptotic) approximations, substituted difficult calculations for (sometimes dubious) assumptions, or chose different methods. A key point is that the harsh realities of real-world resource constraints have forced a different approach to analyze the data. While it may be unsatisfactory to use an approach below the ideal, it may be equally unsatisfactory not to analyze the data.

As computing capacity has increased in the last century, we have slowly been replacing these old assumptions with computing. There is no need for asymptotic normality if we can calculate a less restrictive solution with a powerful computer. A simple example of this is the two-sample permutation test that is as powerful as a standard t-test but without distribution assumptions. The problem, of course, is that these old assumptions are difficult, and to this day it may be difficult to code a solution when a formula is at hand.

Cheaper hierarchical modelling

An example of my own work involves the hierarchical modelling of air pollution data and time series of health. In the early 2000s, we observed national data on mortality and air pollution in the USA. We had daily data on mortality and pollution (and many other covariates) in 100 major US cities. They covered a period of approximately 14 years. To make efficient use of this huge data set, the goal was to use a hierarchical model to estimate a "national" association between air pollution and mortality, as well as city-specific estimates of cities. It was a familiar approach that worked flawlessly on smaller datasets. The "correct" approach would have been to use a Poisson probability for each of the cities (to model mortality count data) and then have normal random effects for interceptions of interception and air pollution.

But at that time, we did not have a computer that could calculate the model estimate (or, in our case, later distributions). So the "right" model was not an option. What we ended up doing was to use a Normal approximation for the Poisson probability, justified by the rather large samples we had, which allowed a Normal-Normal two-stage model that could be calculated without having to load all the data into memory. simple, this could be done in a closed way, which is the standard approach to model time series data on air pollution and health in several places because it is fast, cheap and easy to understand.

Integrity

Finally, these resource constraints can affect the reliability of the analysis. In a reliable analysis, what is presented as the analysis is often supported by many facts and details that are not presented? These other analyzes were made, but the analyst decided (probably based on a certain narrative of the data) that they did not reach the limit of presentation. That said, if anyone asks for these details, they will be available. With greater resources, the sum total of all things that can be done is greater, which gives us hope that things that were not done are orthogonal to what was done.

However, with fewer resources, there are at least two consequences. First, it is likely that fewer things can be done with the data. Fewer data controls, model assumptions controls, convergence controls, model validations, etc. This increases the number of things that have not been done and increases the likelihood of having an impact on the final results (presented). Second, certain types of analysis may require more computational time or power than those available. In order to present any analysis, we may have to resort to "cheaper" approaches or methodologies.

These approaches are not necessarily inaccurate but may produce noisier or less than optimal results. That being said, the other parties involved in the analysis, such as the audience or the user, may prefer any analysis to be made, regardless of their optimism, rather than having no analysis. Sometimes the question itself remains vague or rather harsh, so it's all right if the accompanying analysis is equally "quick and dirty." However, analysts have to draw the line between what is a reasonable analysis and what is not, given the resources available.

Although resource constraints can affect the reliability of an analysis, sometimes the use of approximations to deal with resource constraints can generate benefits. In the previous example, regarding the air pollution model and mortality, the approach we used made the models fit the data very quickly. The benefit of the low cost of computing, in this case, allows the analyst to traverse several different models to examine the robustness of the results in various confounding factors and perform important sensitivity analyzes. If each model took days to calculate, you could settle for a single template fit. In other words, resource constraints can produce an analysis that, although approximate, is actually more reliable than optimal analysis.

The work of the analyst

The job of the data analyst is to manage the resources available for analysis and produce the best analysis possible, subject to existing constraints. The availability of resources may not be unique to the analyst, but the job is, however, to recognize what is available, determine if the resources are sufficient.

Data science course in chennai

Monday, October 15, 2018

Responsibility of Resources in Data Analysis

2 comments:

Merits & Demerits of Data Analytics

Report Abuse