A Multi-stage Data Mining Approach for Liquid Bulk Cargo Volume Analysis based on Bill of Lading Data

Suhyeon Kim, Wonho Sohn, Dongcheol Lim, 이정혜 (2021) · Expert Systems with Applications · DOI ↗

항만 액체 벌크 화물 (Liquid Bulk Cargo, LBC) 물동량 분석의 3 단계 multi-stage data mining framework — (1) Item Text-based Segmentation (ITS) — HS code 의 ambiguity + typo 문제 해결을 위한 LBC-specific item dictionary (LBC-ID) + LBC-specific spell checker (LBC-SC, Levenshtein distance 기반); (2) Exploratory volume analysis — geography + timeline + purpose (import/export/transshipment) 의 multi-view 통계; (3) Volume prediction — manifold-learning (pca, lle-locally-linear-embedding, mds-multidimensional-scaling) + regressor (svr-support-vector-regression, random-forest) + deep learning (LSTM many-to-one). ulsan-port BL 데이터 172,802 sample (2007.1-2017.12) 적용. 카테고리 수준 +34%, 부속품 수준 +18% 정확도 개선 vs ARIMA/SARIMA baseline. Smart Port digital transformation 의 decision support system. 이정혜 의 2 기 UNIST → 3 기 TEMEP 전환 의 응용 도메인 다변화 — 의료 AI + 기술 트렌드 + 항만 물류 의 3 축.

RQ: (a) BL (bill of lading) 데이터의 micro-level 정보가 customs/PCS data 보다 LBC volume 분석에 더 효과적인가? (b) HSCS (HS code segmentation system) 의 vague description + typo 문제를 어떻게 해결하는가? (c) Univariate ARIMA/SARIMA baseline 대비 multivariate manifold + deep learning 의 정확도 우위는?
방법론: Stage 1 — Item Text-based Segmentation (ITS) = LBC-ID (category + subcategory + keyword 사전) + LBC-SC (Levenshtein distance 기반 4-step 알고리즘: training set 구성 → edit distance 계산 → candidate set 추출 → token matching). Stage 2 — Exploratory volume analysis (geography + timeline + import/export/transshipment). Stage 3 — Volume prediction: (i) Manifold learning (pca globally linear, lle-locally-linear-embedding locally linear, mds-multidimensional-scaling distance-preserving) × regressor (RBF-kernel svr-support-vector-regression, random-forest) = 6 combinations (PCA-SVR, PCA-RF, LLE-SVR, LLE-RF, MDS-SVR, MDS-RF). (ii) Deep learning: LSTM (many-to-one, 3-rank tensor input: batch × time lag × variables)
데이터: ulsan-port (세계 4 위 LBC 항만, 동북아 oil/gas hub) BL data 2007.1-2017.12, 172,954 → 172,802 LBC sample (152 null 제외), 15,249 unique item text. 변수: Ship name, Call letter, Date, Facility, BL number, Import/Export, Unloading/Loading port + country, HS code, Item text, Weight ton, Volume ton, Consignor. LBC volume = max(weight ton, 0.883 × volume ton). External: 환율, 경제지표 등 13 외부 변수
주요 발견: (i) ITS 가 HSCS 의 typo 해결: LBC-SC 의 Levenshtein algorithm 으로 ‘Cuude’ → ‘Crude’ 같은 typo 수정. Candidate set 추출 + token probability 기반 best match. (ii) 카테고리 수준 prediction 정확도 +34% 평균 개선 vs ARIMA/SARIMA. (iii) 부속품 수준 +18% 평균 개선 — sample 부족 + variability 로 카테고리 수준보다 낮음. (iv) LSTM > Manifold + regressor combinations — sequential time dependency 의 활용. (v) Exploratory analysis 가 국가별, 항만별, 월별, 화물 type 별 hidden pattern 식별 — raw HSCS 가 못 보이는 sub-category 통계
시사점: (a) Smart Port digital transformation 의 정량 기반 — port operator 에게 category + subcategory 별, 지역별, 시간별 예측 제공. (b) Stakeholder (shipper, consignee, port authority, government agent) 의 mid-/long-term port planning 의 decision support. (c) Multivariate technique 가 LBC 의 volatility + 외부 요인 sensitivity (환율, 경제, 국제수지) 반영. (d) Naphtha 0.18 톤 ( $125) → petrochemical$ 9,000 added value 의 경제적 ripple effect 의 정량 anchor (Kim-Ko 2007 인용). (e) 일반 cargo 외 container, gas 등 다른 cargo type 응용 가능

Fig. 2 의 3 단계 framework: (1) Item Segmentation (ITS + LBC-ID + LBC-SC) → (2) Exploratory Volume Analysis (geography + timeline + purpose) → (3) Volume Prediction (manifold learning + LSTM). BL 데이터의 micro-level 정보를 활용해 LBC 의 volatile + non-stationary 특성을 multi-view 로 분석.

요약

본 paper 는 이정혜 의 2 기 후반 / 3 기 초반 의 응용 도메인 다변화 의 핵심 작업. UNIST Department of Industrial Engineering 의 Suhyeon Kim + Wonho Sohn (equal contribution) + Dongcheol Lim + 이정혜 (corresponding) 의 공저. 동기는 port cargo volume 분석의 컨테이너 편향 — 기존 연구 대부분이 container cargo 에 focus, liquid bulk cargo (LBC) 의 정량 분석 공백. LBC 의 직간접 경제적 ripple effect (석유 → 자동차·조선·석유화학) 가 큼에도, volatility + non-stationarity + 외부요인 sensitivity 때문에 분석 어렵다. Customs data + PCS data 가 BL data 의 aggregated form — micro-level (item text, ship name) 정보 손실. HS code system 의 vague description + typo 문제 — wrong code assignment 의 systematic error. 본 paper 는 BL data 의 micro-level information 을 3 단계 data mining 으로 활용해 category + subcategory + multi-view + multivariate 분석.

Stage 1 — Item Text-based Segmentation (ITS). LBC-specific item dictionary (LBC-ID) = category + subcategory + keyword 의 hierarchical 사전. Subcategory 정의 기준: 화학적 특성 + 재료/components + 산업 활용. Keyword 는 item text 에서 추출, 다른 subcategory 와 겹치는 keyword 제외. Subcategory 분리 가능 — 같은 “Oil Products” 카테고리 안에 Light Oil, Gasoline, Naphtha 가 서로 다른 volumetric properties. LBC-specific spell checker (LBC-SC, Algorithm 1) — Levenshtein edit distance 기반 4-step:

Training set 구성: Clean set $C$ = LBC-ID keyword 의 $q$ 회 duplicate; Dirty set $D$ = BL item text 의 typo; $S = \{C, D\}$
Levenshtein distance 계산: 모든 token pair 의 minimum edit distance + token probability $p_v = n_v / n_S$
Candidate set 추출: typo $w$ 에 대해 $A_w^1 = \{u | edit(u, w) = 1\}$ , $A_w^2 = \{u | edit(u, w) = 2\}$ , token probability 내림차순 정렬
Token matching: $A_w$ 의 각 candidate $g$ 가 clean set $C$ 에 있으면 $c_w \leftarrow g$ (highest probability + min distance)

예: ‘Cuude’ → ‘Crude’ (vs ‘Curde’, ‘Cruda’ 의 candidate 중 highest probability).

Stage 2 — Exploratory volume analysis. Category + subcategory × country × period × purpose (import/export/transshipment) 의 multi-view 통계. Hidden pattern — 같은 category 안의 subcategory 별 volumetric 차이, 국가별 dominant subcategory 차이, 월별/계절별 패턴.

Stage 3 — Volume prediction. (i) Manifold learning + regressor: 입력 $\boldsymbol{X} = [\boldsymbol{X}^{(1)}, \ldots, \boldsymbol{X}^{(K)}]$ , $K$ 개 cargo-related variable. Time lag $d$ 의 window 로 $N - d + 1$ obs. Manifold (PCA, LLE, MDS) 로 latent variable $\boldsymbol{l}_z \in R^{N-d+1}$ , $z = 1, \ldots, Z$ 추출. Regressor (RBF-SVR, RF) 로 volume forecast. 6 combinations. (ii) LSTM: many-to-one RNN, 3-rank tensor input (batch $N$ × time lag $d$ × variables $K$ ), 출력 $\hat{x}_{kt}$ . Gradient vanishing 방지로 LSTM 채택 (Gers-Schmidhuber 2001).

Case study: Ulsan port (세계 4 위 LBC 항만), 2007-2017 BL 데이터 172,802 sample. ARIMA/SARIMA baseline 대비 정확도 비교. 결과: 카테고리 수준 +34%, 부속품 수준 +18% 평균 개선. LSTM 이 manifold + regressor 보다 우위 — sequential dependency 활용. 한계: (i) Ulsan port 단일 case — 다른 항만 generalizability 미검증, (ii) BL data 의 consignor-specific typo pattern — 다른 country/port 에서 다를 수 있음, (iii) external variable (환율, 경제) selection 의 ad-hoc, (iv) cross-country trade route 의 bidirectional dynamics 미모형화. 이정혜 author page 의 3 기 응용 다변화 의 port logistics 첫 anchor.

핵심 결과

예측 정확도 개선 (vs ARIMA/SARIMA baseline)

수준	평균 정확도 개선	Note
카테고리 (예: Oil Products, Chemicals)	+34%	manifold + LSTM combined
부속품 (예: Light Oil, Gasoline, Naphtha)	+18%	sample 부족 + variability

Data summary (Ulsan port BL, 2007-2017)

차원	값
총 BL sample	172,954
LBC sample (null 제외)	172,802
Unique item text	15,249
분석 기간	2007.1 - 2017.12 (132 month)
Variables (BL)	18

Manifold + Regressor combinations (Stage 3 평가)

Manifold	Regressor	특징
PCA	SVR / RF	Globally linear
LLE	SVR / RF	Locally linear
MDS	SVR / RF	Distance-preserving
—	LSTM (many-to-one)	Sequential time dependency

정량 결론. LSTM (deep learning) 이 manifold + regressor 보다 일관되게 높은 정확도 — time lag 의 sequential dependency 활용. 카테고리 수준 (+34%) > 부속품 수준 (+18%) 의 sample 크기 차이 반영. Smart Port operator 의 category + subcategory 별 예측 의 정확한 quantitative tool.

방법론 노트

LBC-SC (Algorithm 1) — Levenshtein distance 기반 spell checker:

edit(v, j) = \text{Levenshtein 최소 edit distance}(v, j)

p_v = \frac{n_v}{n_S}

여기서 $n_v$ = token $v$ frequency, $n_S$ = training set size.

Manifold learning-based prediction: 입력 $\boldsymbol{X}^{(k)} = [\boldsymbol{x}_1^{(k)}, \ldots, \boldsymbol{x}_{N-d+1}^{(k)}]^T$ , time variable $\boldsymbol{x}_i^{(k)} = [x_i^{(k)}, \ldots, x_{i+d-1}^{(k)}]^T$ . Manifold ( $\phi$ = PCA, LLE, MDS) 후 latent $\boldsymbol{l}_z \in R^{N-d+1}$ . Regressor $f$ :

\hat{y}_{kt} = f(\boldsymbol{l}_z), \quad f \in \{\text{SVR}_\text{RBF}, \text{Random Forest}\}

LSTM — many-to-one:

\hat{x}_{k,t} = \text{LSTM}(\boldsymbol{X}_{t-d:t-1}), \quad \boldsymbol{X}_{t-d:t-1} \in R^{N \times d \times K}

여기서 $N$ = batch size, $d$ = time lag, $K$ = cargo-related variables 수. Many-to-one = sequence input → single scalar output (time $t$ ).

식별은 (i) 172,802 sample 의 충분한 cross-section + temporal variation, (ii) LSTM 의 gated mechanism 이 long-range dependency 학습, (iii) manifold 의 latent representation 이 curse of dimensionality 회피, (iv) ARIMA/SARIMA baseline 의 univariate 한계 surface. Limitation: (a) Ulsan port 단일 case, (b) external variable 의 ad-hoc selection, (c) hyperparameter (time lag $d$ , latent dimension $Z$ , LSTM hidden units) 의 data-dependent tuning.

연구 계보

본 paper 의 port cargo volume prediction lineage: Jai Sankar et al (2016, Chennai LBC ARIMA), Kim (2008, Busan 3-category SARIMA), Kim-Woo (2017, Ulsan moving average + regression with seasonality), Kim et al (2018, two-way seasonality multiplied regressive). Univariate baseline 의 한계 → multivariate. LBC exploratory analysis lineage: Zhang-Xing (2018, China/India crude oil import), Wang et al (2019, Northern Sea Route 25 port). HS code + BL data lineage: Adland et al (2017), Lee (2020), Guszczak-Mencarelli (2020), Spyridoula (2019). Manifold learning lineage: Jolliffe (2002 PCA), Mead (1992 MDS), Saul-Roweis (2000 LLE), Lin-Zha (2008), Ma-Fu (2011). Regressor: Drucker et al (1997 SVR), Ho (1995 Random Forest). LSTM: Gers-Schmidhuber (2001). Levenshtein: Levenshtein (1966).

TEMEP 내 sibling: 이정혜 의 suhyeon-kim 연속 4 작 의 두 번째 — Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis (text embedding) → 본 paper (multistage data mining) → Risk score-embedded deep learning for biological age estimation: Development and validation (RSAE 의료 AI) → TMF-GNN: Temporal matrix factorization-based graph neural network for multivariate time series forecasting with missing values (graph neural network 시계열). 방법론적 진화: text embedding → multistage segmentation + manifold + LSTM → autoencoder embedding → GNN. Wonho Sohn (equal contribution) 의 항만 물류 라인 후속 연구 (port trade graph + LSTM 등) 의 출발점.