How to find the right model?
Before starting a regression analysis, one has to decide on a model. If this seems to be a difficult task, it seems tempting to leave the selection of the model to a computer program (e.g. TableCurve from SYSTAT). However, we would like to warn against this procedure. For example, if one takes only the class of polynomial functions, one will usually find a function that predicts the measured data sufficiently well if one chooses the degree of the polynomial large enough. However, the interpretation of the parameter of the fitted polynomial function will succeed only in the rarest cases! Therefore the model is not usable for a scientific evaluation. Exactly this problem also arises with the automated selection of a model by a computer. The computer has no knowledge about the scientific background of the underlying experiment and therefore cannot take it into account for the model selection. However, this is a prerequisite for the interpretability of the parameters of the model. In summary, this means that model selection is not a mathematical or statistical task, but a scientific one. If one wants to explain a physical, chemical or biological relationship, the model selection must be made by scientists with the appropriate expert knowledge. After performing the regression analysis, there is the possibility to evaluate the goodness of fit of the model and to fit an extended or different model if necessary.
Step 2: Selection of parameters to be adjusted, constraints
Once a model has been selected, it is necessary to decide which parameters are to be fitted to the data, in what range they are allowed to vary, and which parameters are set to a fixed value before fitting. If we consider again the model for radioactive decay, Y = Span ⋅ exp (- K ⋅ X ) + Plateau it is known that in the limit of large time all radioactive isotopes have decayed. Therefore, in this case, it is appropriate to set the parameter Plateau = 0 and not to adjust it by the regression. Since it is a decaying decay process, also the secondary condition K > 0 is reasonable, because for negative K a growth process is represented.
Step 3: Selection of start values
Nonlinear regression is an iterative process. Therefore, it is necessary to assign start values to the parameters to be adjusted. This can be of great importance, because if the initial values are set incorrectly, the iteration process may not converge. Once start values have been selected, it is recommended to draw the initial model over the given data to verify that the initial model is at least roughly fitted to the data.
Step 4: Execution of the analysis and interpretation of the results
Once you have run the nonlinear regression, here are some things to consider: - Does the fitted model describe the data well? To answer this question, it is occasionally sufficient to look at the graph of the function and the data. For example, if one has chosen the wrong model, the convergence point of the model parameters may have little to do with the data. A similar thing can happen if the initial parameters are chosen incorrectly. Finally, there are statistical tests to evaluate the goodness of fit. - Are the fitted parameters plausible? The computer doing the fitting has no knowledge of the scientific meaning of the parameters. Therefore, the first thing to check is whether the calculated parameters are plausible in the sense of being scientifically interpretable. If, for example, a parameter Span < 0 comes out in the fit of the radioactive decay, this may give the best fit statistically, but physically it makes no sense, since Span represents the number of non-decayed isotopes at time t = 0. If such a scientifically nonsensical result exists, the regression result must be discarded. Possibly one can arrive at a meaningful result by an additional constraint and a renewed analysis. - How precise are the parameters? As with any statistical point estimator, the associated confidence intervals are of utmost importance for the calculated values of nonlinear regression. Usually, in addition to the estimators for the parameters, their standard error (standard deviation of the point estimator) and the 95% confidence interval are given. If the latter is relatively small, the estimate is relatively reliable; otherwise, the estimate should be viewed with great caution.
Step 5: Checking assumptions
Every regression analysis is based on certain prerequisites. Therefore, it must be checked whether these are fulfilled:
- X is deterministic, the variation is entirely in Y.
- The dispersion in Y follows a known (usually normal) distribution for fixed X.
- The scatter in Y is the same regardless of X. - The observations are independent.