The Arena¶

Every algorithm has a Python implementation and has been run across all datasets, real and synthetic, regression and logistic, with up to 50 hyperparameter configurations per algorithm where applicable.

Each (algorithm, config) run produces one coefficient vector \(\widehat\beta \in \mathbb{R}^p\). At this time we have two visualisations:

Embedding map: unit-normalised vectors \(\widehat\beta/\|\widehat\beta\|\) projected to 2D by t-SNE and UMAP. Algorithmically similar solutions cluster; hover to see the algorithm name and exact config.
Coefficient heatmap: all runs for one dataset as a matrix (rows = runs, columns = feature indices). The top bar shows mean \(|\widehat\beta_j|\) per feature. Hover for exact value.

All algorithms share the interface SolverCls(config={...}).fit(X, y, link=...) -> FitResult.

Tip

This page is best explored on a desktop or laptop browser, where you can zoom, pan, and hover. If you want to suggest a dataset or a new arena function, feel free to reach out at chattelion.luo@connect.polyu.hk

Real datasets¶

Diabetes¶

n = 442 · p = 10 · Regression · UCI / sklearn

Breast Cancer¶

n = 569 · p = 30 · Logistic · UCI / sklearn

Digits (>= 5)¶

n = 1 797 · p = 64 · Logistic · NIST / sklearn

Fair Affairs¶

n = 2 000 · p = 8 · Regression · statsmodels

RAND HIE¶

n = 2 000 · p = 9 · Regression · statsmodels

STAR98¶

n = 303 · p = 21 · Logistic · statsmodels

ANES96¶

n = 944 · p = 10 · Logistic · statsmodels

Mode Choice¶

n = 840 · p = 8 · Logistic · statsmodels

Synthetic — regression¶

Sparse¶

n = 2 000 · p = 200 · 20 informative features · Regression

Dense¶

n = 2 000 · p = 100 · all features informative · Regression

Correlated¶

n = 2 000 · p = 150 · effective rank 10 · Regression

High-dimensional¶

n = 2 000 · p = 300 · 30 informative features · Regression

Friedman #1¶

n = 2 000 · p = 10 · nonlinear ground truth · Regression

Synthetic — logistic¶

Dense¶

n = 2 000 · p = 100 · 80 informative features · Logistic

Sparse¶

n = 2 000 · p = 200 · 25 informative features · Logistic

Noisy¶

n = 2 000 · p = 150 · 12 % label noise · Logistic

Datasets¶

Dataset	Source	n	p	Task
Diabetes	UCI / sklearn	442	10	Regression
Breast Cancer	UCI / sklearn	569	30	Logistic
Digits (>= 5)	NIST / sklearn	1 797	64	Logistic
Fair Affairs	statsmodels	2 000	8	Regression
RAND HIE	statsmodels	2 000	9	Regression
STAR98	statsmodels	303	21	Logistic
ANES96	statsmodels	944	10	Logistic
Mode Choice	statsmodels	840	8	Logistic
Synth sparse (p=200)	synthetic	2 000	200	Regression
Synth dense (p=100)	synthetic	2 000	100	Regression
Synth correlated (p=150)	synthetic	2 000	150	Regression
Synth high-dim (p=300)	synthetic	2 000	300	Regression
Synth Friedman #1	synthetic	2 000	10	Regression
Synth logit dense (p=100)	synthetic	2 000	100	Logistic
Synth logit sparse (p=200)	synthetic	2 000	200	Logistic
Synth logit noisy (p=150)	synthetic	2 000	150	Logistic

Real datasets are standardised (zero mean, unit variance). Synthetic datasets are generated with controlled sparsity, rank structure, and noise level.

LIBSVM Benchmark Datasets¶

The following visualisations cover 38 datasets from the LIBSVM data repository, comprising 27,732 solver runs across regression and binary-classification tasks. Datasets span from tiny (n = 38, p = 7,129 for Leukemia) to large (n = 80,000, p = 2,000 for Epsilon).

Regression¶

Abalone¶

n = 4,177 · p = 8 · Regression · LIBSVM · 100 runs

Body Fat¶

n = 252 · p = 14 · Regression · LIBSVM · 862 runs

California Housing (cadata)¶

n = 20,640 · p = 8 · Regression · LIBSVM · 872 runs

CPU Small¶

n = 8,192 · p = 12 · Regression · LIBSVM · 831 runs

EUNITE 2001 Electricity¶

n = 336 · p = 16 · Regression · LIBSVM · 880 runs

Boston Housing¶

n = 506 · p = 13 · Regression · LIBSVM · 891 runs

Mackey-Glass (mg)¶

n = 1,385 · p = 6 · Regression · LIBSVM · 872 runs

Auto MPG¶

n = 392 · p = 7 · Regression · LIBSVM · 878 runs

Pyrimidines¶

n = 74 · p = 27 · Regression · LIBSVM · 870 runs

Space GA¶

n = 3,107 · p = 6 · Regression · LIBSVM · 890 runs

Triazines¶

n = 186 · p = 60 · Regression · LIBSVM · 870 runs

Year Prediction MSD¶

n = 80,000 · p = 90 · Regression · LIBSVM · 884 runs

Binary Classification¶

Australian Credit¶

n = 690 · p = 14 · Binary logistic · LIBSVM · 437 runs

Breast Cancer (LIBSVM)¶

n = 683 · p = 10 · Binary logistic · LIBSVM · 781 runs

COD-RNA¶

n = 59,535 · p = 8 · Binary logistic · LIBSVM · 765 runs

Colon Cancer¶

n = 62 · p = 2,000 · Binary logistic · LIBSVM · 711 runs

Diabetes (Pima)¶

n = 768 · p = 8 · Binary logistic · LIBSVM · 600 runs

Duke Breast Cancer¶

n = 44 · p = 7,129 · Binary logistic · LIBSVM · 771 runs

Epsilon¶

n = 80,000 · p = 2,000 · Binary logistic · LIBSVM · 594 runs

Fourclass¶

n = 862 · p = 2 · Binary logistic · LIBSVM · 280 runs

German Credit (numerical)¶

n = 1,000 · p = 24 · Binary logistic · LIBSVM · 761 runs

Gisette¶

n = 6,000 · p = 5,000 · Binary logistic · LIBSVM · 791 runs

Heart Disease¶

n = 270 · p = 13 · Binary logistic · LIBSVM · 367 runs

IJCNN1¶

n = 49,990 · p = 22 · Binary logistic · LIBSVM · 781 runs

Ionosphere¶

n = 351 · p = 34 · Binary logistic · LIBSVM · 631 runs

Leukemia¶

n = 38 · p = 7,129 · Binary logistic · LIBSVM · 777 runs

Liver Disorders¶

n = 145 · p = 5 · Binary logistic · LIBSVM · 598 runs

Madelon¶

n = 2,000 · p = 500 · Binary logistic · LIBSVM · 771 runs

Mushrooms¶

n = 8,124 · p = 112 · Binary logistic · LIBSVM · 794 runs

Phishing Websites¶

n = 11,055 · p = 68 · Binary logistic · LIBSVM · 777 runs

Skin/Non-Skin¶

n = 80,000 · p = 3 · Binary logistic · LIBSVM · 761 runs

Sonar¶

n = 208 · p = 60 · Binary logistic · LIBSVM · 701 runs

Splice¶

n = 1,000 · p = 60 · Binary logistic · LIBSVM · 778 runs

SVMguide1¶

n = 3,089 · p = 4 · Binary logistic · LIBSVM · 801 runs

SVMguide3¶

n = 1,243 · p = 22 · Binary logistic · LIBSVM · 771 runs

w8a¶

n = 49,749 · p = 300 · Binary logistic · LIBSVM · 775 runs

a9a¶

n = 32,561 · p = 123 · Binary logistic · LIBSVM · 290 runs

Covertype (binary)¶

n = 80,000 · p = 54 · Binary logistic · LIBSVM · 645 runs

Dataset Summary¶

Dataset	n	p	Task	Runs
Abalone	4,177	8	Regression	862
Body Fat	252	14	Regression	872
California Housing (cadata)	20,640	8	Regression	831
CPU Small	8,192	12	Regression	880
EUNITE 2001 Electricity	336	16	Regression	891
Boston Housing	506	13	Regression	872
Mackey-Glass (mg)	1,385	6	Regression	878
Auto MPG	392	7	Regression	870
Pyrimidines	74	27	Regression	890
Space GA	3,107	6	Regression	870
Triazines	186	60	Regression	884
Year Prediction MSD	80,000	90	Regression	437
Australian Credit	690	14	Binary logistic	781
Breast Cancer (LIBSVM)	683	10	Binary logistic	765
COD-RNA	59,535	8	Binary logistic	711
Colon Cancer	62	2,000	Binary logistic	600
Diabetes (Pima)	768	8	Binary logistic	771
Duke Breast Cancer	44	7,129	Binary logistic	594
Epsilon	80,000	2,000	Binary logistic	280
Fourclass	862	2	Binary logistic	761
German Credit (numerical)	1,000	24	Binary logistic	791
Gisette	6,000	5,000	Binary logistic	367
Heart Disease	270	13	Binary logistic	781
IJCNN1	49,990	22	Binary logistic	631
Ionosphere	351	34	Binary logistic	777
Leukemia	38	7,129	Binary logistic	598
Liver Disorders	145	5	Binary logistic	771
Madelon	2,000	500	Binary logistic	794
Mushrooms	8,124	112	Binary logistic	777
Phishing Websites	11,055	68	Binary logistic	761
Skin/Non-Skin	80,000	3	Binary logistic	701
Sonar	208	60	Binary logistic	778
Splice	1,000	60	Binary logistic	801
SVMguide1	3,089	4	Binary logistic	771
SVMguide3	1,243	22	Binary logistic	775
w8a	49,749	300	Binary logistic	290
a9a	32,561	123	Binary logistic	645
Covertype (binary)	80,000	54	Binary logistic	623