Surrogate model-based Derivative-Free Optimizations

surrogate-model

Surrogate model-based Derivative-Free Optimization

Surrogate model

\ell(\theta) = m(\theta) + e(\theta)

고비용의 목적 함수를 근사하는 저비용의 모델.

$\ell(\theta)$ : upper level objective (실제 목적 함수, 고비용)
$m(\theta)$ : surrogate model (저비용 근사 모델)
$e(\theta)$ : noise (잔차)

Optimization steps

어떤 하이퍼파라미터 $\theta$ 를 고르면, 그걸로 모델을 학습해서 테스트 목적 함수 값 $\ell(\theta, w^*(\zeta), \mathcal{D}_{\text{test}})$ 을 얻을 수 있음. 하지만 이 과정은 시간도 오래 걸리고, 학습 랜덤성 $\zeta$ 때문에 노이즈도 있음. 그렇기에 직접 $\ell(\theta)$ 를 최적화 하는 대신, 몇 번만 계산해본 결과들을 토대로 근사함수(surrogate model)를 만들어 놓고 그걸 이용해서 더 효율적으로 최적화를 하려는 것이 목적.

Approximate a function
Select next hyperparameter

Radial Basis Functions (RBF) + Stochastic Sampling

radial = distance from a center

m_{RBF}(\theta) = \sum_{i=1}^{N} \lambda_i \phi(||\theta - \theta_i||) + p(\theta)

$\theta_i$ : 지금까지 실험해본 하이퍼파라미터 값들
$\ell_i$ : 그때 얻은 테스트 목적 함수 값의 평균값(기대값)
$\phi$ : 중심 $\theta_i$ 로부터의 거리만을 보는 함수 (Gaussian kernel, $r^3$ 등)
$\lambda_i$ : 각 중심 $\theta_i$ 주변의 영향력을 결정하는 계수
$p(\theta) = b_0 + b^\top \theta$ : 전체적인 선형 추세(기울기) 를 표현

이 근사 모델이 제대로 작동하려면 다음 두 가지 조건을 만족해야 함:

보간 조건: 모든 실험점에서 근사 모델이 실제 실험값과 일치해야 함.
$m_{\mathrm{RBF}}(\theta_i) = \ell_i, \quad i=1,\dots,N$
추세 분리 조건: RBF 부분이 선형 추세와 섞이지 않도록,
$P^T \lambda = 0$

이 두 조건을 합치면 다음과 같은 연립방정식을 얻게 됨.

\begin{bmatrix} \Phi & P \\ P^T & 0 \end{bmatrix} \begin{bmatrix} \lambda \\ b \end{bmatrix} = \begin{bmatrix} \ell \\ 0 \end{bmatrix}

\Phi_{ij} = \phi(||\theta_i - \theta_j||), \quad P = \begin{bmatrix} \theta_1^T & 1 \\ \theta_2^T & 1 \\ \vdots & \vdots \\ \theta_N^T & 1 \end{bmatrix}, \quad b = \begin{bmatrix} b \\ b_2 \\ \vdots \\ b_{d} \\ b_0 \end{bmatrix}, \quad \ell = \begin{bmatrix} \mathbb{E}_{\zeta}\!\left[ \ell(\theta_1, w_1^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_2, w_2^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \vdots \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_n, w_n^*(\zeta), \mathcal{D}_{\text{test}}) \right] \end{bmatrix}

$\Phi_{ij} = \phi(\lVert \theta_i - \theta_j \rVert)$ : 각 점들 사이의 거리에 대한 RBF 값
$P$ : 각 점의 좌표와 상수항 1을 붙여 만든 행렬
$\ell$ : 각 실험점에서 얻은 평균 목적 함수 값

이 시스템을 풀면 $\lambda, b$ 가 정해짐.

Note (확률적 샘플링)

$\ell$ 벡터 안의 $\mathbb{E}_\zeta[\cdot]$ (확률적 샘플링)

실제로는 기대값 $\mathbb{E}_\zeta[\cdot]$ 을 직접 알 수 없으니, 같은 하이퍼파라미터 $\theta_i$ 를 여러 번 돌려보고 평균을 냄.

\ell_i \approx \frac{1}{M_i}\sum_{m=1}^{M_i}\ell(\theta_i, w_i^*(\zeta_{i,m}), \mathcal{D}_{\text{test}})

이렇게 하면 랜덤성 때문에 생기는 잡음을 줄일 수 있음.
중요한 후보점은 반복 횟수를 더 늘려서 더 정확하게 평균을 추정하기도 함.
노이즈가 심하면 “보간” 대신 “회귀” 방식으로 약간 틀어져도 되게 만듦 (ridge 같은 방식).

최종 최적화 루프는 다음과 같음:

몇 개 하이퍼파라미터 $\theta_i$ 를 골라서 실제 목적 함수 값 $\ell_i$ 를 계산
위 시스템을 풀어 $m_{\mathrm{RBF}}$ 근사 모델을 구축
이 모델을 기반으로 새로운 $\theta$ 를 선택
그 $\theta$ 를 실제로 평가하고 다시 2번으로 돌아감

Selecting the next evaluation point

근사 모델이 완성되면, 그걸 단순히 최소화해서 다음 $\theta$ 를 고를 수도 있지만, 이렇게 하면 이미 조사한 영역만 파고드는 문제가 생길 수 있음. 그래서 탐색(exploration) 과 활용(exploitation) 을 모두 고려하는 절차를 씀:

Local Search: 지금까지 찾은 가장 좋은 점을 기준으로, 각 좌표를 ±(작은 랜덤 값) 만큼 흔들어 $M$ 개의 후보를 생성
Global Search: 동시에, 전체 탐색 공간에서 무작위로 $M$ 개의 후보를 추가 생성
총 $2M$ 후보 각각에 대해 두 가지 점수를 계산:
- RBF Score: surrogate 모델 $m_{\mathrm{RBF}}$ 이 예측한 값
  $s_{\text{RBF}}(x_i) = \sum_{j=1}^n \hat{\lambda}_j \phi(\lVert x_i - \theta_j \rVert) + p(x_i)$
- Distance Score: 기존 평가점들과 얼마나 멀리 떨어져 있는지
  $\Delta(x_i, \Theta) = \min_{j=1,\dots,n} \lVert x_i - \theta_j \rVert$
두 점수를 각각 0~1 범위로 스케일링:
$V_\Delta(x_i) = \frac{\Delta_{\max} - \Delta(x_i,\Theta)}{\Delta_{\max} - \Delta_{\min}}, \quad V_s(x_i) = \frac{s_{\text{RBF}}(x_i) - s_{\min}}{s_{\max} - s_{\min}}$
가중합으로 최종 점수 계산:
$V(x_i) = \omega V_\Delta(x_i) + (1-\omega) V_s(x_i)$
- $\omega$ 값으로 탐색과 활용의 균형을 조절 ( $\omega$ 가 크면 탐색 위주, 작으면 활용 위주)
다음 평가점은 $V(x_i)$ 가 가장 작은 후보로 선택

이렇게 하면 단순히 surrogate 최소값을 고르는 게 아니라,

RBF 점수 → 지금까지 학습된 surrogate 상에서 좋은 위치(활용)
Distance 점수 → 새로운 지역을 탐색해보려는 시도(탐색)

Example (RBF + Stochastic Sampling python example)

1
import numpy as np
2

3
# -------------------------
4
# 목적 함수 (예시)
5
# -------------------------
6
def objective(x):
7
    """노이즈가 있는 quadratic"""
8
    return np.sum(np.array(x)**2) + np.random.randn()*0.1
9

10
# -------------------------
11
# RBF Surrogate
12
# -------------------------
13
class RBF_Surrogate:
14
    def __init__(self, phi="cubic"):
15
        if phi == "gaussian":
16
            self.phi = lambda r: np.exp(-(r**2))
17
        elif phi == "linear":
18
            self.phi = lambda r: r
19
        else:  # cubic
20
            self.phi = lambda r: r**3
21

22
    def fit(self, X, y):
23
        self.X = np.array(X)
24
        self.y = np.array(y)
25
        N, d = self.X.shape
26

27
        Phi = np.zeros((N, N))
28
        for i in range(N):
29
            for j in range(N):
30
                Phi[i, j] = self.phi(np.linalg.norm(self.X[i] - self.X[j]))
31

32
        P = np.hstack([self.X, np.ones((N, 1))])
33
        A = np.block([
34
            [Phi, P],
35
            [P.T, np.zeros((d+1, d+1))]
36
        ])
37
        b = np.concatenate([self.y, np.zeros(d+1)])
38

39
        sol = np.linalg.solve(A, b)
40
        self.lmbda = sol[:N]
41
        self.beta = sol[N:]
42

43
    def predict(self, X_new):
44
        X_new = np.atleast_2d(X_new)
45
        y_pred = np.zeros(len(X_new))
46

47
        for k, xk in enumerate(X_new):
48
            rbf_sum = 0
49
            for i in range(len(self.X)):
50
                r = np.linalg.norm(xk - self.X[i])
51
                rbf_sum += self.lmbda[i] * self.phi(r)
52
            trend = np.dot(self.beta[:-1], xk) + self.beta[-1]
53
            y_pred[k] = rbf_sum + trend
54
        return y_pred
55

56
# -------------------------
57
# Data 관리 클래스
58
# -------------------------
59
class DataStore:
60
    def __init__(self, dim, m_init=5, bounds=(-2, 2)):
61
        self.dim = dim
62
        self.bounds = bounds
63
        self.m = m_init
64

65
        self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
66
        self.Y = np.array([objective(s) for s in self.S])
67
        self.rbf = RBF_Surrogate()
68
        self.update_rbf()
69

70
    def update_rbf(self):
71
        self.rbf.fit(self.S, self.Y)
72

73
    def add_point(self, x):
74
        y = objective(x)
75
        self.S = np.vstack([self.S, x])
76
        self.Y = np.append(self.Y, y)
77
        self.m += 1
78
        self.update_rbf()
79

80
# -------------------------
81
# 후보 점수 계산 (RBF 예측 + 거리)
82
# -------------------------
83
def RBF_score(candidates, data, omega=0.5):
84
    s = data.rbf.predict(candidates)
85
    dist = np.array([np.min(np.linalg.norm(data.S - c, axis=1)) for c in candidates])
86

87
    # 0~1 scaling
88
    s_scaled = (s - s.min()) / (s.max() - s.min() + 1e-8)
89
    d_scaled = (dist.max() - dist) / (dist.max() - dist.min() + 1e-8)
90

91
    score = omega*d_scaled + (1-omega)*s_scaled
92
    return score
93

94
# -------------------------
95
# 최적화 루프
96
# -------------------------
97
def run_rbf(dim=2, n_iter=10):
98
    np.random.seed(0)
99
    data = DataStore(dim=dim)
100

101
    for it in range(n_iter):
102
        best_idx = np.argmin(data.Y)
103
        best_x = data.S[best_idx]
104

105
        # 후보 생성
106
        local_candidates = best_x + np.random.uniform(-0.5, 0.5, size=(20, dim))
107
        global_candidates = np.random.uniform(data.bounds[0], data.bounds[1], size=(20, dim))
108
        candidates = np.vstack([local_candidates, global_candidates])
109

110
        # 점수 계산
111
        score = RBF_score(candidates, data, omega=0.5)
112

113
        # 다음 점 선택
114
        next_x = candidates[np.argmin(score)]
115
        data.add_point(next_x)
116

117
        print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
118

119
    return data
120

121
# 실행
122
if __name__ == "__main__":
123
    result = run_rbf(dim=2, n_iter=10)

Gaussian Process (GP) + Expected Improvement (EI)

가우시안 프로세스(GP)는 함수 $\ell(\theta)$ 를 평균과 공분산 구조로 묘사하는 확률적 모델.

m_{GP}(\theta) = \mu + Z(\theta)

$\mu$ : 평균 (stochastic process의 mean)
$Z(\theta)$ : GP 랜덤항, $Z(\theta) \sim \mathcal{N}(0, \sigma^2)$
서로 다른 위치의 $Z(\theta)$ 들은 거리에 따라 상관관계를 가짐

Note (GP's meaning)

평균 (Mean) $m_{GP}(\theta^{new})$ : 현재까지 관찰된 데이터들을 바탕으로 예측한 $\ell(\theta^{new})$ 의 가장 가능성 높은 값 (기대값).
분산 (Variance) $s^2(\theta^{new})$ : 위 예측이 얼마나 불확실한지를 나타내는 값.
- 관측된 데이터 포인트 근처에서는 분산이 매우 작음 (거의 0). 즉, 예측이 매우 확실.
- 관측된 데이터 포인트들로부터 멀리 떨어진 미지의 영역에서는 분산이 큼. 즉, 예측이 매우 불확실.

상관관계는 보통 아래와 같이 커널 함수로 정의됨 (“가까운 입력은 비슷한 출력을 가질 것이다”라는 가정).:

\text{Corr}(Z(\theta_k), Z(\theta_l)) = \exp\left(-\sum_{i=1}^d \gamma_i \lvert \theta_k^{(i)} - \theta_l^{(i)} \rvert^{q_i}\right)

$d$ : 하이퍼파라미터 차원 수
$\gamma_i$ : 각 차원의 스케일링 파라미터 (maximium likelihood estimation으로 추정)
$q_i$ : smoothness 파라미터 (커널의 형태를 결정, 보통 1 또는 2)

주어진 데이터 $\{(\theta_i, \ell_i)\}_{i=1}^n$ 에 대해, 평균과 분산은 다음과 같이 추정됨:

\hat{\mu} = \frac{1^T R^{-1} \ell}{1^T R^{-1} 1}, \quad \hat{\sigma}^2 = \frac{(\ell - 1 \hat{\mu})^T R^{-1} (\ell - 1 \hat{\mu})}{n}

$R$ : $n\times n$ correlation matrix ( $R_{kl} = \text{Corr}(Z(\theta_k), Z(\theta_l))$ )
$\ell$ : upper level objective 벡터 (실제 평가값들)

새로운 점 $\theta^{new}$ 에서의 평균(예측값):

기존 관측값들의 가중 평균으로 예측을 수행

m_{GP}(\theta^{new}) = \hat{\mu} + r^T R^{-1} (\ell - 1\hat{\mu})

r = \begin{bmatrix} \text{Corr}(Z(\theta^{new}), Z(\theta_1)) \\ \vdots \\ \text{Corr}(Z(\theta^{new}), Z(\theta_n)) \end{bmatrix}

$r^T R^{-1}$ : 가중치에 해당하는 부분.
$r$ : $\theta^{new}$ 와 기존 점들 간의 상관관계 벡터
$R$ : 기존 점들끼리의 상관관계 행렬.

즉, $\theta^{new}$ 와 더 가까운(상관관계가 높은) 기존 점의 함수값 $\ell_i$ 에 더 높은 가중치를 부여하여 예측값을 계산

새로운 점 $\theta^{new}$ 에서의 분산(불확실성) 추정:

s^2(\theta^{new}) = \hat{\sigma}^2 \left( 1 - r^T R^{-1} r + \frac{(1 - 1^T R^{-1} r)^2}{1^T R^{-1} 1} \right)

핵심은 $r^T R^{-1} r$ $r^{T} R^{- 1} r$ 항으로, $\theta^{new}$ $θ^{n e w}$ 가 기존 데이터 포인트들과 얼마나 강하게 연관되어 있는지를 나타냄.
- 만약 $\theta^{new}$ 가 기존 점들과 매우 가깝다면(탐색이 이미 된 영역), $r^T R^{-1} r$ 값은 1에 가까워져서 전체 분산 $s^2(\theta^{new})$ 는 0에 가까워짐 (불확실성 감소).
- 반대로 $\theta^{new}$ 가 기존 점들과 매우 멀다면(미탐색 영역), 이 값은 0에 가까워져서 분산은 커짐 (불확실성 증가).

최종 최적화 루프는 다음과 같음:

몇 개 초기점에서 $\ell(\theta)$ 를 계산
GP 모델을 학습해 $m_{GP}(\theta)$ , $s(\theta)$ 추정
후보 $\theta$ 들에 대해 Expected Improvement $E[I]$ 계산
$\theta^{new} = \arg\max_\theta E[I]$ 선택
실제 평가 후 데이터셋에 추가 → 2단계로 돌아감

Expected Improvement (EI)

다음 평가점을 고르기 위해, 현재까지의 최적값보다 더 좋아질 기대값을 고려.

현재 최적 목적함수 값:
$\ell^{best} = \ell(\theta^{best})$
개선(improvement):
$I = \ell^{best} - L, \quad L \sim \mathcal{N}(m_{GP}(\theta), s^2(\theta))$
- $L$ : 불확실성을 반영하기 위한 확률변수 (즉, 평균이 $m_{GP}(\theta)$ 이고 분산이 $s^2(\theta)$ 인 정규분포를 따르는 값)
기대 개선값:
$E[I] = s(\theta) \left( Z \cdot \Phi(Z) + \phi(Z) \right) \quad \text{where} \quad Z = \frac{\ell^{best} - m_{GP}(\theta)}{s(\theta)}$
- $\Phi$ : 표준정규 누적분포 (CDF)
- $\phi$ : 표준정규 밀도함수 (PDF)

genetic algorithm 등을 이용해 $E[I]$ 를 최대화하는 $\theta$ 를 찾음.

Example (GP + EI python example)

1
import numpy as np
2
import array
3
from deap import base, creator, tools, algorithms
4
from sklearn.gaussian_process import GaussianProcessRegressor
5
from sklearn.gaussian_process.kernels import RBF
6
from scipy.stats import norm
7

8
# -------------------------
9
# 목적 함수 (black-box 예시)
10
# -------------------------
11
def objective(x):
12
    """노이즈가 있는 2차 함수"""
13
    return np.sum(np.array(x) ** 2) + np.random.randn() * 0.1
14

15
# -------------------------
16
# 데이터 저장용 클래스 흉내
17
# -------------------------
18
class DataStore:
19
    def __init__(self, dim, m_init=5, bounds=(-2, 2)):
20
        self.dim = dim
21
        self.bounds = bounds
22
        self.m = m_init  # 초기 샘플 수
23

24
        self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
25
        self.Y = np.array([objective(s) for s in self.S])
26

27
        # GP surrogate
28
        self.gpr = GaussianProcessRegressor(kernel=RBF(), random_state=0)
29
        self.update_gp()
30

31
    def update_gp(self):
32
        self.gpr.fit(self.S, self.Y.reshape(-1, 1))
33

34
    def add_point(self, x):
35
        y = objective(x)
36
        self.S = np.vstack([self.S, x])
37
        self.Y = np.append(self.Y, y)
38
        self.m += 1
39
        self.update_gp()
40

41
# -------------------------
42
# EI 함수 (스크린샷 기반)
43
# -------------------------
44
def Expected_improvement(x, data):
45
    x_to_predict = np.array(x).reshape(1, -1)
46
    mu, sigma = data.gpr.predict(x_to_predict, return_std=True)
47

48
    greater_is_better = False
49
    if greater_is_better:
50
        loss_optimum = np.max(data.Y[:data.m])
51
    else:
52
        loss_optimum = np.min(data.Y[:data.m])
53

54
    scaling_factor = (-1) ** (not greater_is_better)
55

56
    with np.errstate(divide='ignore'):
57
        Z = scaling_factor * (mu - loss_optimum) / sigma
58
        expected_improvement = scaling_factor * (mu - loss_optimum) * norm.cdf(Z) + sigma * norm.pdf(Z)
59
        expected_improvement[sigma == 0.0] = 0.0
60
    return -expected_improvement[0]  # deap은 최소화 문제
61

62
# -------------------------
63
# DEAP 설정
64
# -------------------------
65
def run_gp_ei(dim=2, n_iter=10):
66
    data = DataStore(dim=dim)
67

68
    # Fitness 정의
69
    creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
70
    creator.create("Individual", array.array, typecode='d', fitness=creator.FitnessMin)
71

72
    toolbox = base.Toolbox()
73

74
    # 각 차원 범위
75
    for i in range(dim):
76
        INT_MIN, INT_MAX = data.bounds
77
        toolbox.register(f"attr_float_{i}", np.random.uniform, INT_MIN, INT_MAX)
78

79
    # Individual 생성
80
    toolbox.register("individual", tools.initCycle, creator.Individual,
81
                     (toolbox.__getattribute__(f"attr_float_{i}") for i in range(dim)), n=1)
82
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)
83

84
    # Evaluate = EI
85
    toolbox.register("evaluate", Expected_improvement, data=data)
86
    toolbox.register("mate", tools.cxTwoPoint)
87
    toolbox.register("mutate", tools.mutUniformInt,
88
                     low=[data.bounds[0]]*dim, up=[data.bounds[1]]*dim, indpb=0.2)
89
    toolbox.register("select", tools.selTournament, tournsize=3)
90

91
    # -------------------------
92
    # 최적화 루프
93
    # -------------------------
94
    for it in range(n_iter):
95
        pop = toolbox.population(n=20)
96
        hof = tools.HallOfFame(1)
97

98
        algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.3,
99
                            ngen=15, halloffame=hof, verbose=False)
100

101
        next_x = np.array(hof[0])
102
        data.add_point(next_x)
103

104
        print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
105

106
    return data
107

108
# 실행
109
if __name__ == "__main__":
110
    result = run_gp_ei(dim=2, n_iter=10)