Skip to content

The Arena

Every algorithm has a Python implementation and has been run across all datasets, real and synthetic, regression and logistic, with up to 50 hyperparameter configurations per algorithm where applicable.

Each (algorithm, config) run produces one coefficient vector \(\widehat\beta \in \mathbb{R}^p\). At this time we have two visualisations:

  • Embedding map: unit-normalised vectors \(\widehat\beta/\|\widehat\beta\|\) projected to 2D by t-SNE and UMAP. Algorithmically similar solutions cluster; hover to see the algorithm name and exact config.
  • Coefficient heatmap: all runs for one dataset as a matrix (rows = runs, columns = feature indices). The top bar shows mean \(|\widehat\beta_j|\) per feature. Hover for exact value.

All algorithms share the interface SolverCls(config={...}).fit(X, y, link=...) -> FitResult.


Tip

This page is best explored on a desktop or laptop browser, where you can zoom, pan, and hover. If you want to suggest a dataset or a new arena function, feel free to reach out at chattelion.luo@connect.polyu.hk

Real datasets

Diabetes

n = 442 · p = 10 · Regression · UCI / sklearn

Breast Cancer

n = 569 · p = 30 · Logistic · UCI / sklearn

Digits (>= 5)

n = 1 797 · p = 64 · Logistic · NIST / sklearn

Fair Affairs

n = 2 000 · p = 8 · Regression · statsmodels

RAND HIE

n = 2 000 · p = 9 · Regression · statsmodels

STAR98

n = 303 · p = 21 · Logistic · statsmodels

ANES96

n = 944 · p = 10 · Logistic · statsmodels

Mode Choice

n = 840 · p = 8 · Logistic · statsmodels


Synthetic — regression

Sparse

n = 2 000 · p = 200 · 20 informative features · Regression

Dense

n = 2 000 · p = 100 · all features informative · Regression

Correlated

n = 2 000 · p = 150 · effective rank 10 · Regression

High-dimensional

n = 2 000 · p = 300 · 30 informative features · Regression

Friedman #1

n = 2 000 · p = 10 · nonlinear ground truth · Regression


Synthetic — logistic

Dense

n = 2 000 · p = 100 · 80 informative features · Logistic

Sparse

n = 2 000 · p = 200 · 25 informative features · Logistic

Noisy

n = 2 000 · p = 150 · 12 % label noise · Logistic


Datasets

Dataset Source n p Task
Diabetes UCI / sklearn 442 10 Regression
Breast Cancer UCI / sklearn 569 30 Logistic
Digits (>= 5) NIST / sklearn 1 797 64 Logistic
Fair Affairs statsmodels 2 000 8 Regression
RAND HIE statsmodels 2 000 9 Regression
STAR98 statsmodels 303 21 Logistic
ANES96 statsmodels 944 10 Logistic
Mode Choice statsmodels 840 8 Logistic
Synth sparse (p=200) synthetic 2 000 200 Regression
Synth dense (p=100) synthetic 2 000 100 Regression
Synth correlated (p=150) synthetic 2 000 150 Regression
Synth high-dim (p=300) synthetic 2 000 300 Regression
Synth Friedman #1 synthetic 2 000 10 Regression
Synth logit dense (p=100) synthetic 2 000 100 Logistic
Synth logit sparse (p=200) synthetic 2 000 200 Logistic
Synth logit noisy (p=150) synthetic 2 000 150 Logistic

Real datasets are standardised (zero mean, unit variance). Synthetic datasets are generated with controlled sparsity, rank structure, and noise level.

LIBSVM Benchmark Datasets

The following visualisations cover 38 datasets from the LIBSVM data repository, comprising 27,732 solver runs across regression and binary-classification tasks. Datasets span from tiny (n = 38, p = 7,129 for Leukemia) to large (n = 80,000, p = 2,000 for Epsilon).

Regression

Abalone

n = 4,177 · p = 8 · Regression · LIBSVM · 100 runs

Body Fat

n = 252 · p = 14 · Regression · LIBSVM · 862 runs

California Housing (cadata)

n = 20,640 · p = 8 · Regression · LIBSVM · 872 runs

CPU Small

n = 8,192 · p = 12 · Regression · LIBSVM · 831 runs

EUNITE 2001 Electricity

n = 336 · p = 16 · Regression · LIBSVM · 880 runs

Boston Housing

n = 506 · p = 13 · Regression · LIBSVM · 891 runs

Mackey-Glass (mg)

n = 1,385 · p = 6 · Regression · LIBSVM · 872 runs

Auto MPG

n = 392 · p = 7 · Regression · LIBSVM · 878 runs

Pyrimidines

n = 74 · p = 27 · Regression · LIBSVM · 870 runs

Space GA

n = 3,107 · p = 6 · Regression · LIBSVM · 890 runs

Triazines

n = 186 · p = 60 · Regression · LIBSVM · 870 runs

Year Prediction MSD

n = 80,000 · p = 90 · Regression · LIBSVM · 884 runs

Binary Classification

Australian Credit

n = 690 · p = 14 · Binary logistic · LIBSVM · 437 runs

Breast Cancer (LIBSVM)

n = 683 · p = 10 · Binary logistic · LIBSVM · 781 runs

COD-RNA

n = 59,535 · p = 8 · Binary logistic · LIBSVM · 765 runs

Colon Cancer

n = 62 · p = 2,000 · Binary logistic · LIBSVM · 711 runs

Diabetes (Pima)

n = 768 · p = 8 · Binary logistic · LIBSVM · 600 runs

Duke Breast Cancer

n = 44 · p = 7,129 · Binary logistic · LIBSVM · 771 runs

Epsilon

n = 80,000 · p = 2,000 · Binary logistic · LIBSVM · 594 runs

Fourclass

n = 862 · p = 2 · Binary logistic · LIBSVM · 280 runs

German Credit (numerical)

n = 1,000 · p = 24 · Binary logistic · LIBSVM · 761 runs

Gisette

n = 6,000 · p = 5,000 · Binary logistic · LIBSVM · 791 runs

Heart Disease

n = 270 · p = 13 · Binary logistic · LIBSVM · 367 runs

IJCNN1

n = 49,990 · p = 22 · Binary logistic · LIBSVM · 781 runs

Ionosphere

n = 351 · p = 34 · Binary logistic · LIBSVM · 631 runs

Leukemia

n = 38 · p = 7,129 · Binary logistic · LIBSVM · 777 runs

Liver Disorders

n = 145 · p = 5 · Binary logistic · LIBSVM · 598 runs

Madelon

n = 2,000 · p = 500 · Binary logistic · LIBSVM · 771 runs

Mushrooms

n = 8,124 · p = 112 · Binary logistic · LIBSVM · 794 runs

Phishing Websites

n = 11,055 · p = 68 · Binary logistic · LIBSVM · 777 runs

Skin/Non-Skin

n = 80,000 · p = 3 · Binary logistic · LIBSVM · 761 runs

Sonar

n = 208 · p = 60 · Binary logistic · LIBSVM · 701 runs

Splice

n = 1,000 · p = 60 · Binary logistic · LIBSVM · 778 runs

SVMguide1

n = 3,089 · p = 4 · Binary logistic · LIBSVM · 801 runs

SVMguide3

n = 1,243 · p = 22 · Binary logistic · LIBSVM · 771 runs

w8a

n = 49,749 · p = 300 · Binary logistic · LIBSVM · 775 runs

a9a

n = 32,561 · p = 123 · Binary logistic · LIBSVM · 290 runs

Covertype (binary)

n = 80,000 · p = 54 · Binary logistic · LIBSVM · 645 runs

Dataset Summary

Dataset n p Task Runs
Abalone 4,177 8 Regression 862
Body Fat 252 14 Regression 872
California Housing (cadata) 20,640 8 Regression 831
CPU Small 8,192 12 Regression 880
EUNITE 2001 Electricity 336 16 Regression 891
Boston Housing 506 13 Regression 872
Mackey-Glass (mg) 1,385 6 Regression 878
Auto MPG 392 7 Regression 870
Pyrimidines 74 27 Regression 890
Space GA 3,107 6 Regression 870
Triazines 186 60 Regression 884
Year Prediction MSD 80,000 90 Regression 437
Australian Credit 690 14 Binary logistic 781
Breast Cancer (LIBSVM) 683 10 Binary logistic 765
COD-RNA 59,535 8 Binary logistic 711
Colon Cancer 62 2,000 Binary logistic 600
Diabetes (Pima) 768 8 Binary logistic 771
Duke Breast Cancer 44 7,129 Binary logistic 594
Epsilon 80,000 2,000 Binary logistic 280
Fourclass 862 2 Binary logistic 761
German Credit (numerical) 1,000 24 Binary logistic 791
Gisette 6,000 5,000 Binary logistic 367
Heart Disease 270 13 Binary logistic 781
IJCNN1 49,990 22 Binary logistic 631
Ionosphere 351 34 Binary logistic 777
Leukemia 38 7,129 Binary logistic 598
Liver Disorders 145 5 Binary logistic 771
Madelon 2,000 500 Binary logistic 794
Mushrooms 8,124 112 Binary logistic 777
Phishing Websites 11,055 68 Binary logistic 761
Skin/Non-Skin 80,000 3 Binary logistic 701
Sonar 208 60 Binary logistic 778
Splice 1,000 60 Binary logistic 801
SVMguide1 3,089 4 Binary logistic 771
SVMguide3 1,243 22 Binary logistic 775
w8a 49,749 300 Binary logistic 290
a9a 32,561 123 Binary logistic 645
Covertype (binary) 80,000 54 Binary logistic 623