The best thing to do is at the beginning of the study to take a random sub-sample, without any particular stratification and to put it aside for the confirmatory stage. Many scientists are mean with their data, and only have just enough to model, but nowadays the expense of an extra 25 % or so, should be made - especially when the consequences of the study are medical, this is what tukey and mallows call a careful serarate diagnostic.
For instance in Discriminant Analysis
For each observation, do the analysis without that one, and look whether or not it is well classified, this will give an unbiased estimate of the percentage of badly classified. Cross Validation can thus be used when one variable has the particular status of being explained.
And in regresssion
We want to estimate the prediction error:
However it has also been used at the diagnostic stage in principal components, and in classification and regression trees where it helps choose the size of an `optimal tree'.