The sources of error¶

the main three sources of error:

Approximation
Statistical
Optimization

Using geometric priors (symmetry, scale separation) we decrease the size/complexity of the hypothesis class (e.g. only admit equivariant NNs), without discarding useful hypotheses.

This should: - Improve statistical error - Not worsen approximation error - Universality results: want to approximate any equivariant continuous function

ref. Cohen, T. (n.d.). AMMI 2022 Course “Geometric Deep Learning” - Lecture 3 (Geometric Priors I) - Taco Cohen. link

In machine learning, there are many sources of error that can affect the performance of models. The three main sources of error are: approximation, statistical, and optimization error.

Approximation error

Approximation error, also known as bias error or model error, refers to the error introduced when the model approximates a complex function using a simpler function. It is inherent to the model itself and can be reduced by using a more complex model or adjusting the hyperparameters of the model.

The degree of approximation error depends on the complexity of the model used. When the machine learning model is too simple, it may not capture the true underlying pattern in the data, leading to a difference between the predicted output and the actual output. This error cannot be reduced by simply increasing the amount of training data.

For example, a linear regression model may have a high approximation error when trying to predict non-linear relationships between variables, while a neural network model with multiple layers may have a lower approximation error as it can capture more complex patterns.

It’s important to note that there is no one architecture that fits all purposes. Certain architectures may perform better on specific types of data or tasks, but no single architecture is superior in every way. For instance, when dealing with 2D-images, a convolutional neural network (CNN) model would perform better than a recurrent neural network (RNN), since CNN models are designed to work with 2D-images and their properties, while RNNs are not. However, RNNs outperform CNNs with language data.

Reducing approximation error typically involves using a more complex model, choosing a more appropriate architecture, or adjusting the hyperparameters of the model to better fit the data. However, this must be balanced against the risk of overfitting and increasing other types of errors, such as variance error. Techniques such as regularization can also help to reduce approximation error by constraining the complexity of the model.

Statistical error

Statistical error refers to the difference between the true underlying distribution of data and the distribution that a machine learning model predicts based on the training data it was given. It can be caused by bias, variance, overfitting, or underfitting, and can be reduced by adjusting the complexity of the model, regularization, or data augmentation.

Bias occurs when a machine learning model consistently makes predictions that are different from the true values. This can happen when the model is too simple and cannot capture the complexity of the data or when it is trained on a dataset that is not representative of the true population.

Variance occurs when a machine learning model is too sensitive to small fluctuations in the training data. This can happen when the model is too complex and overfits to the training data, leading to poor performance on new data.

Overfitting occurs when a machine learning model becomes too complex and begins to fit the noise in the training data rather than the underlying true pattern. This leads to poor performance on new data that is not part of the training set.

Underfitting occurs when a machine learning model is too simple to capture the underlying pattern in the data, leading to poor performance on both the training and test data.

Reducing statistical error requires finding the right balance between model complexity and the size and representativeness of the training data. Techniques such as regularization, cross-validation, and data augmentation can also help mitigate statistical error.

Optimization error

This refers to the error introduced when a machine learning model is trained using an optimization algorithm that attempts to minimize the difference between the predicted output and the actual output on the training data. Optimization error can be caused by local minima, convergence speed, regularization, or hyperparameters, and can be reduced by careful selection of optimization algorithms and hyperparameters.

Local minima: Optimization algorithms such as gradient descent can get stuck in local minima, where the model parameters are not optimized for the global minimum and hence lead to suboptimal performance.

Convergence speed: The convergence speed of optimization algorithms can affect the final performance of the model. Slow convergence can result in a suboptimal solution or overfitting, while fast convergence can result in underfitting.

Regularization: Regularization techniques such as L1, L2, and dropout can help reduce overfitting, but they may also introduce additional optimization errors.

Hyperparameters: The choice of hyperparameters such as learning rate, batch size, and number of epochs can have a significant impact on the optimization error.

Reducing optimization error requires careful selection of optimization algorithms and hyperparameters. Techniques such as early stopping, learning rate annealing, and momentum can help improve the convergence speed and reduce the likelihood of getting stuck in local minima. Regularization techniques can help reduce overfitting while balancing the optimization error. Finally, tuning hyperparameters through cross-validation can help find the optimal combination of hyperparameters that minimize the optimization error.

Other sources of error include evaluation and sampling errors.

Evaluation error This error occurs when the model is evaluated using a dataset that is different from the one it was trained on. Overfitting or underfitting can cause evaluation errors, which can be reduced by using cross-validation or an independent test dataset.
Sampling error This error occurs when the training data is not representative of the true population or is biased towards a specific class or subset. Sampling errors can result in overfitting or underfitting, and can be mitigated by collecting a larger and more diverse dataset.

Reducing these sources of error requires a combination of techniques, such as collecting a representative and diverse dataset, using a suitable model with optimal hyperparameters and regularization techniques, and carefully evaluating the model’s performance on independent datasets.

머신 러닝에는 모델의 성능에 영향을 줄 수 있는 많은 오류의 원인이 있습니다. 오류의 세 가지 주요 원인은 근사치, 통계적 오류, 최적화 오류입니다.

근사치 오류 편향 오류 또는 모델 오류라고도 하는 근사 오차는 모델이 더 간단한 함수를 사용하여 복잡한 함수를 근사화할 때 발생하는 오류를 말합니다. 이는 모델 자체에 내재되어 있으며 더 복잡한 모델을 사용하거나 모델의 하이퍼파라미터를 조정하여 줄일 수 있습니다.

근사치 오차의 정도는 사용된 모델의 복잡성에 따라 달라집니다. 머신러닝 모델이 너무 단순하면 데이터의 실제 기본 패턴을 포착하지 못하여 예측된 결과와 실제 결과 사이에 차이가 발생할 수 있습니다. 이러한 오류는 단순히 학습 데이터의 양을 늘리는 것만으로는 줄일 수 없습니다.

예를 들어, 선형 회귀 모델은 변수 간의 비선형 관계를 예측할 때 근사치 오차가 클 수 있지만, 여러 계층으로 구성된 신경망 모델은 더 복잡한 패턴을 포착할 수 있기 때문에 근사치 오차가 낮을 수 있습니다.

모든 목적에 적합한 하나의 아키텍처는 없다는 점에 유의해야 합니다. 특정 아키텍처는 특정 유형의 데이터나 작업에서 더 나은 성능을 발휘할 수 있지만, 모든 면에서 우월한 단일 아키텍처는 없습니다. 예를 들어, 2D 이미지를 처리할 때는 컨볼루션 신경망(CNN) 모델이 순환 신경망(RNN)보다 더 나은 성능을 발휘할 수 있는데, CNN 모델은 2D 이미지와 그 속성에 맞게 설계되었지만 RNN은 그렇지 않기 때문입니다. 하지만 언어 데이터에서는 RNN이 CNN보다 성능이 뛰어납니다.

근사치 오류를 줄이려면 일반적으로 더 복잡한 모델을 사용하거나, 더 적절한 아키텍처를 선택하거나, 데이터에 더 잘 맞도록 모델의 하이퍼파라미터를 조정해야 합니다. 그러나 이는 과적합의 위험과 분산 오차와 같은 다른 유형의 오류를 증가시킬 수 있는 위험과 균형을 이루어야 합니다. 정규화와 같은 기법도 모델의 복잡성을 제한하여 근사치 오차를 줄이는 데 도움이 될 수 있습니다.

통계적 오류 통계적 오류는 데이터의 실제 기본 분포와 머신러닝 모델이 주어진 학습 데이터를 기반으로 예측하는 분포 사이의 차이를 말합니다. 편향, 분산, 과적합 또는 과소적합으로 인해 발생할 수 있으며, 모델의 복잡성 조정, 정규화 또는 데이터 보강을 통해 줄일 수 있습니다.

편향은 머신 러닝 모델이 지속적으로 실제 값과 다른 예측을 할 때 발생합니다. 이는 모델이 너무 단순하여 데이터의 복잡성을 포착할 수 없거나 실제 모집단을 대표하지 않는 데이터 세트로 학습된 경우 발생할 수 있습니다.

분산은 머신러닝 모델이 학습 데이터의 작은 변동에 너무 민감할 때 발생합니다. 이는 모델이 너무 복잡하고 학습 데이터에 과적합하여 새로운 데이터에서 성능이 저하될 때 발생할 수 있습니다.

과적합은 머신러닝 모델이 너무 복잡해져 기본 실제 패턴이 아닌 학습 데이터의 노이즈에 맞추기 시작할 때 발생합니다. 이로 인해 학습 세트에 포함되지 않은 새로운 데이터에 대한 성능이 저하됩니다.

과소적합은 머신러닝 모델이 너무 단순하여 데이터의 기본 패턴을 포착하지 못할 때 발생하며, 이로 인해 학습 데이터와 테스트 데이터 모두에서 성능이 저하됩니다.

통계적 오류를 줄이려면 모델 복잡성과 학습 데이터의 크기 및 대표성 사이의 적절한 균형을 찾아야 합니다. 정규화, 교차 검증, 데이터 증강과 같은 기법도 통계적 오류를 완화하는 데 도움이 될 수 있습니다.

최적화 오류 머신러닝 모델이 학습 데이터에서 예측된 출력과 실제 출력의 차이를 최소화하려는 최적화 알고리즘을 사용하여 학습할 때 발생하는 오류를 말합니다. 최적화 오류는 국부 최소값, 수렴 속도, 정규화 또는 하이퍼파라미터로 인해 발생할 수 있으며, 최적화 알고리즘과 하이퍼파라미터를 신중하게 선택하면 줄일 수 있습니다.

국부 최소값: 경사 하강과 같은 최적화 알고리즘은 모델 파라미터가 전역 최소값에 최적화되지 않은 국부 최소값에 갇혀서 최적의 성능을 내지 못할 수 있습니다.

수렴 속도: 최적화 알고리즘의 수렴 속도는 모델의 최종 성능에 영향을 미칠 수 있습니다. 수렴 속도가 느리면 차선책 또는 과적합이 발생할 수 있고, 수렴 속도가 빠르면 과소적합이 발생할 수 있습니다.

정규화: L1, L2, 드롭아웃과 같은 정규화 기법은 과적합을 줄이는 데 도움이 될 수 있지만 추가적인 최적화 오류가 발생할 수도 있습니다.

하이퍼파라미터: 학습 속도, 배치 크기, 에포크 수와 같은 하이퍼파라미터의 선택은 최적화 오류에 큰 영향을 미칠 수 있습니다.

최적화 오류를 줄이려면 최적화 알고리즘과 하이퍼파라미터를 신중하게 선택해야 합니다. 조기 중지, 학습 속도 어닐링, 모멘텀과 같은 기법은 수렴 속도를 개선하고 국부 최소값에 갇힐 가능성을 줄이는 데 도움이 될 수 있습니다. 정규화 기법은 최적화 오류의 균형을 맞추면서 과적합을 줄이는 데 도움이 될 수 있습니다. 마지막으로 교차 검증을 통해 하이퍼파라미터를 조정하면 최적화 오류를 최소화하는 최적의 하이퍼파라미터 조합을 찾는 데 도움이 될 수 있습니다.

다른 오류의 원인으로는 평가 오류와 샘플링 오류가 있습니다.

평가 오류 이 오류는 학습된 데이터 세트와 다른 데이터 세트를 사용하여 모델을 평가할 때 발생합니다. 과적합 또는 과소적합은 평가 오류를 유발할 수 있으며, 교차 검증 또는 독립적인 테스트 데이터 세트를 사용하면 이 오류를 줄일 수 있습니다.

샘플링 오류 이 오류는 학습 데이터가 실제 모집단을 대표하지 않거나 특정 클래스 또는 하위 집합에 편향되어 있을 때 발생합니다. 샘플링 오류는 과적합 또는 과소적합을 초래할 수 있으며, 더 크고 다양한 데이터 세트를 수집함으로써 완화할 수 있습니다.

이러한 오류의 원인을 줄이려면 대표적이고 다양한 데이터셋을 수집하고, 최적의 하이퍼파라미터와 정규화 기법을 갖춘 적합한 모델을 사용하고, 독립적인 데이터셋에서 모델의 성능을 신중하게 평가하는 등 여러 가지 기술을 조합해야 합니다.