Rによるデータ同化と機械学習

榎本剛

2025-08-27

Rとは

Rは統計言語を超越

Gagolewski (2024)

Let’s get one thing straight: R is not just a statistical package. It is a general-purpose, high-level programming language that happens to be very powerful for numerical, data-intense computing activities of any kind. It offers extensive support for statistical, machine learning, data analysis, data wrangling, and data visualisation applications, but there is much more.

Why use R for oceanographic analysis?

With its broad statistical support, R is a natural choice for oceanographers in the biological, chemical and geological sub-disciplines. However, some physical oceanographers have remained attached to Matlab, … Lately, this has been changing, as oceanographers turn to open-source systems such as Python and R. A particular strength of R is its provision of many powerful and well-vetted packages for handling specialized calculations. The oce package is a prime example.

Rは何なのか

思いつきの確認、データ探索、高速プロトタイプの道具
Python + pandas + matplotlibやJulia、MATLABに相当するプログラミング環境
再現性のある科学 (Schwab et al. (2000);Gentleman and Temple Lang (2007);Marwick et al. (2018))

Rの特徴

基本的な数学函数、統計、行列計算、描画を含む。
添字は1始まり。負の添字は削除。
配列array()や行列matrix()は列優先で数式と対応。
（緩い）函数型 func()
Copy on modify
環境構築のしやすさ
- バッケージ CRAN
- WindowsのRtoolsでコンパイル環境まで整う。

Pythonに疲れてない？

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

x = np.linspace(-5, 5, 101)
y = norm.pdf(x)
ax.plot(x, y)
ax.set_title("normal distribution")
ax.set_xlabel("x")
ax.set_ylabel("y")
plt.show()

curve(dnorm, -5, 5, main="normal distribution", xlab="x", ylab="y")

アンサンブル偏差

R
Python

(x <- matrix(0:11, ncol=4))

     [,1] [,2] [,3] [,4]
[1,]    0    3    6    9
[2,]    1    4    7   10
[3,]    2    5    8   11

(xbar <- apply(x, 1, mean)) # rowMeans(x)

[1] 4.5 5.5 6.5

x - xbar # sweep(x, 1, xbar)

     [,1] [,2] [,3] [,4]
[1,] -4.5 -1.5  1.5  4.5
[2,] -4.5 -1.5  1.5  4.5
[3,] -4.5 -1.5  1.5  4.5

x = np.arange(12).reshape([4, 3]).transpose()
print(x)

[[ 0  3  6  9]
 [ 1  4  7 10]
 [ 2  5  8 11]]

xbar = x.mean(axis=1)
print(xbar)

[4.5 5.5 6.5]

x - xbar[:,None]

array([[-4.5, -1.5,  1.5,  4.5],
       [-4.5, -1.5,  1.5,  4.5],
       [-4.5, -1.5,  1.5,  4.5]])

Rと私

渋谷政昭・柴田里程, 1992: Sによるデータ解析
アンサンブル予報の父、経田さんにRを勧められる。
rglやRMarkdownがきっかけで、2019年頃からRを再び使い始める。
2023年からILASセミナーはPythonからRに。
2024・2025年データ同化夏の学校の課題をRで書く。

Rの環境

CRAN

RStudio

Jupyter

install.packages("IRkernel")
installspec()

Google colaboratory

Rでできること

Rによる気象データサイエンス

テキスト処理 stringi, stringx, stringr
回帰分析、主成分分析 SpectrA
データ同化夏学2024、2025
モデル: 前線形成、Poisson方程式、Lorenz-96
機械学習、画像処理、3D可視化rgl
他の言語とのインターフェース

R中心生活

線型回帰

df <- read.csv("co2_annual_20221026.csv")
lm.co2 <- lm(df$co2.global.mean.ppm. ~ df$year)
plot(df$year, df$co2.global.mean.ppm.,
     main="Global Mean CO2 concentration",
     xlab="year", ylab="CO2 ppm")
abline(lm.co2)

3次元描画

rgl

NCEP再解析

terraとRNetCDF

奥能登豪雨

湿潤函数

Lorenz-96

Code

l96 <- function(x, F) {
  n <- length(x)
  (x[c(2:n, 1)] - x[c(n-1, n, 1:(n-2))]) * x[c(n, 1:(n-1))] - x + F
}

rk4 <- function(f, x, dt, opts) {
  k1 <- f(x, opts)
  k2 <- f(x + 0.5 * dt * k1, opts)
  k3 <- f(x + 0.5 * dt * k2, opts)
  k4 <- f(x + dt * k3, opts)
  x + (k1 + 2 * k2 + 2 * k3 + k4) * dt / 6
}

nj <- 40
nstep <- 1001
x.hist <- matrix(0, nj, nstep)
x <- rnorm(nj)
F <- 8
dt <- 0.05
for (i in 1:nstep-1) {
  x <- rk4(l96, x, dt, F)
  x.hist[,i] <- x
}

t <- seq(0, nstep*dt, length.out=nstep)
filled.contour(1:nj, t, x.hist, nlevel=11, main=paste("Lorenz 96 F=", F),
               ylim=rev(range(t)), xlab="j", ylab="time")

Rで機械学習?

Rの機械学習ライブラリ

CRAN TaskView
e1071: 潜在クラス解析、短時間フーリエ解析、ファジークラスタリング、SVM、最短経路計算、バッグドクラスタリング、単純ベイズ分類、一般化k近傍
kernlab: カーネル学習。分類（SVM）、回帰（分位点回帰、ガウス過程回帰）、クラスタリング（スペクトラルクラスタリング）異常検知、次元削減（カーネルPCA）、最適化（二次計画法ソルバ）
caret: 分類、回帰、訓練

自動微分と数値最適化

torch for R 自動微分を使って
L-BFGSで数値最適化

Code

library(torch)

rosenbrock <- function(x, y, a = 1, b = 100) {
  (a - x)^2 + b * (y - x^2)^2
}

x <- torch_tensor(c(-1, -1), requires_grad = TRUE)

optimizer <- optim_lbfgs(x, line_search_fn = "strong_wolfe")

calc_loss <- function() {
  optimizer$zero_grad()
  value <- rosenbrock(x[1], x[2])
  cat("value is:", as.numeric(value), "\n")
  value$backward()
  value
}

num_iterations <- 2
tol <- 1e-10
xhist <- as.numeric(x)
for (i in 1:num_iterations) {
  cat("\n", "iteration:", i, "\n")
  optimizer$step(calc_loss)
  cat("x=", as.numeric(x), "\n")
  xhist <- rbind(xhist, as.numeric(x))
}


 iteration: 1 
value is: 404 
value is: 62.32629 
value is: 30.0694 
value is: 2.630802 
value is: 1.178554 
value is: 1.15742 
value is: 1.132393 
value is: 1.00142 
value is: 1.091282 
value is: 0.6181912 
value is: 0.8905501 
value is: 0.6098283 
value is: 0.5655912 
value is: 0.3922533 
value is: 0.2774411 
value is: 1.417702 
value is: 0.2190705 
value is: 0.1747326 
value is: 0.1380794 
value is: 0.08087045 
value is: 0.05257415 
value is: 0.1490689 
value is: 0.0400695 
value is: 0.02954894 
value is: 0.01238139 
x= 0.9039377 0.8114879 

 iteration: 2 
value is: 0.01238139 
value is: 0.006821597 
value is: 0.002873288 
value is: 0.001204705 
value is: 0.0006028978 
value is: 4.76946e-05 
value is: 2.687239e-06 
value is: 5.929914e-08 
value is: 1.555293e-09 
value is: 3.588241e-13 
x= 0.9999999 0.9999998

Code

x <- seq(-1, 2, 0.01)
y <- seq(-1, 2, 0.01)
z <- outer(x, y, rosenbrock)

フーリエニューラル演算子

Li et al. (2021)

FNO in torch for R

RからPythonを使う

reticulate

library(reticulate)

np <- import("numpy")
x <- np$load("x.npy")

tensorflow へのインターフェースに利用されている。

Rcpp

RとC++のインターフェース
順行・勾配計算はCoDiPack
フレームワークのオーバーヘッドを回避して高速化
Rパッケージoptimxのnvm
真値: \((r,\sigma, \beta) = (32, 10, 8/3),\, (X, Y, Z) = (1, 3, 5)\)
第一推定値: 30 11 2 1.1 3.3 5.5
torch for R: 62回 32. 10.00 2.667 0.8989 3.128 5.012
nvm: 358回機械精度で厳密に推定

表をきれいに

library(tinytable)
trunc <- c(39, 79, 119, 239, 639, 1279)
nlon <- 3 * (trunc + 1)
nlat <- nlon / 2
df <- data.frame(trunc, nlon, nlat)
tab <- tt(df)
style_tt(tab, align="r")

trunc	nlon	nlat
39	120	60
79	240	120
119	360	180
239	720	360
639	1920	960
1279	3840	1920

Quarto

RMarkdownの進化形

このスライドはMarkdownを拡張した記法で、
Quartoを使って作成。
数式、RやPythonなどコードの実行。
Pandocを使ってHTML、LaTeX、Wordなどに出力。
bibファイルによる文献引用。

まとめ

Rは探索的データ解析に適した統計・描画環境。
海洋・気象・地理データの描画や解析もできる。
統計学者など専門家が作成した高品質なパッケージ。
モデル、データ同化、機械学習のプロトタイプ作成。
Pythonのモジュールの利用やC++による高速化も可能。

参考文献

Gagolewski, M., 2024: Deep R Programming. Zenodo,.

Gentleman, R., and D. Temple Lang, 2007: Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16, 1–23, https://doi.org/10.1198/106186007X178663.

Li, Z., N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, 2021: Fourier Neural Operator for parametric partial differential equations. https://doi.org/10.48550/arXiv.2010.08895.

Marwick, B., C. Boettiger, and L. Mullen, 2018: Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72, 80–88, https://doi.org/10.1080/00031305.2017.1375986.

Schwab, M., N. Karrenbach, and J. Claerbout, 2000: Making scientific computations reproducible. Computing in Science & Engineering, 2, 61–67, https://doi.org/10.1109/5992.881708.