The Arena¶
Every algorithm has a Python implementation and has been run across all datasets, real and synthetic, regression and logistic, with up to 50 hyperparameter configurations per algorithm where applicable.
Each (algorithm, config) run produces one coefficient vector \(\widehat\beta \in \mathbb{R}^p\). At this time we have two visualisations:
- Embedding map: unit-normalised vectors \(\widehat\beta/\|\widehat\beta\|\) projected to 2D by t-SNE and UMAP. Algorithmically similar solutions cluster; hover to see the algorithm name and exact config.
- Coefficient heatmap: all runs for one dataset as a matrix (rows = runs, columns = feature indices). The top bar shows mean \(|\widehat\beta_j|\) per feature. Hover for exact value.
All algorithms share the interface SolverCls(config={...}).fit(X, y, link=...) -> FitResult.
Tip
This page is best explored on a desktop or laptop browser, where you can zoom, pan, and hover. If you want to suggest a dataset or a new arena function, feel free to reach out at chattelion.luo@connect.polyu.hk
Real datasets¶
Diabetes¶
Breast Cancer¶
Digits (>= 5)¶
Fair Affairs¶
RAND HIE¶
STAR98¶
ANES96¶
Mode Choice¶
Synthetic — regression¶
Sparse¶
Dense¶
Correlated¶
High-dimensional¶
Friedman #1¶
Synthetic — logistic¶
Dense¶
Sparse¶
Noisy¶
Datasets¶
| Dataset | Source | n | p | Task |
|---|---|---|---|---|
| Diabetes | UCI / sklearn | 442 | 10 | Regression |
| Breast Cancer | UCI / sklearn | 569 | 30 | Logistic |
| Digits (>= 5) | NIST / sklearn | 1 797 | 64 | Logistic |
| Fair Affairs | statsmodels | 2 000 | 8 | Regression |
| RAND HIE | statsmodels | 2 000 | 9 | Regression |
| STAR98 | statsmodels | 303 | 21 | Logistic |
| ANES96 | statsmodels | 944 | 10 | Logistic |
| Mode Choice | statsmodels | 840 | 8 | Logistic |
| Synth sparse (p=200) | synthetic | 2 000 | 200 | Regression |
| Synth dense (p=100) | synthetic | 2 000 | 100 | Regression |
| Synth correlated (p=150) | synthetic | 2 000 | 150 | Regression |
| Synth high-dim (p=300) | synthetic | 2 000 | 300 | Regression |
| Synth Friedman #1 | synthetic | 2 000 | 10 | Regression |
| Synth logit dense (p=100) | synthetic | 2 000 | 100 | Logistic |
| Synth logit sparse (p=200) | synthetic | 2 000 | 200 | Logistic |
| Synth logit noisy (p=150) | synthetic | 2 000 | 150 | Logistic |
Real datasets are standardised (zero mean, unit variance). Synthetic datasets are generated with controlled sparsity, rank structure, and noise level.
LIBSVM Benchmark Datasets¶
The following visualisations cover 38 datasets from the LIBSVM data repository, comprising 27,732 solver runs across regression and binary-classification tasks. Datasets span from tiny (n = 38, p = 7,129 for Leukemia) to large (n = 80,000, p = 2,000 for Epsilon).
Regression¶
Abalone¶
Body Fat¶
California Housing (cadata)¶
CPU Small¶
EUNITE 2001 Electricity¶
Boston Housing¶
Mackey-Glass (mg)¶
Auto MPG¶
Pyrimidines¶
Space GA¶
Triazines¶
Year Prediction MSD¶
Binary Classification¶
Australian Credit¶
Breast Cancer (LIBSVM)¶
COD-RNA¶
Colon Cancer¶
Diabetes (Pima)¶
Duke Breast Cancer¶
Epsilon¶
Fourclass¶
German Credit (numerical)¶
Gisette¶
Heart Disease¶
IJCNN1¶
Ionosphere¶
Leukemia¶
Liver Disorders¶
Madelon¶
Mushrooms¶
Phishing Websites¶
Skin/Non-Skin¶
Sonar¶
Splice¶
SVMguide1¶
SVMguide3¶
w8a¶
a9a¶
Covertype (binary)¶
Dataset Summary¶
| Dataset | n | p | Task | Runs |
|---|---|---|---|---|
| Abalone | 4,177 | 8 | Regression | 862 |
| Body Fat | 252 | 14 | Regression | 872 |
| California Housing (cadata) | 20,640 | 8 | Regression | 831 |
| CPU Small | 8,192 | 12 | Regression | 880 |
| EUNITE 2001 Electricity | 336 | 16 | Regression | 891 |
| Boston Housing | 506 | 13 | Regression | 872 |
| Mackey-Glass (mg) | 1,385 | 6 | Regression | 878 |
| Auto MPG | 392 | 7 | Regression | 870 |
| Pyrimidines | 74 | 27 | Regression | 890 |
| Space GA | 3,107 | 6 | Regression | 870 |
| Triazines | 186 | 60 | Regression | 884 |
| Year Prediction MSD | 80,000 | 90 | Regression | 437 |
| Australian Credit | 690 | 14 | Binary logistic | 781 |
| Breast Cancer (LIBSVM) | 683 | 10 | Binary logistic | 765 |
| COD-RNA | 59,535 | 8 | Binary logistic | 711 |
| Colon Cancer | 62 | 2,000 | Binary logistic | 600 |
| Diabetes (Pima) | 768 | 8 | Binary logistic | 771 |
| Duke Breast Cancer | 44 | 7,129 | Binary logistic | 594 |
| Epsilon | 80,000 | 2,000 | Binary logistic | 280 |
| Fourclass | 862 | 2 | Binary logistic | 761 |
| German Credit (numerical) | 1,000 | 24 | Binary logistic | 791 |
| Gisette | 6,000 | 5,000 | Binary logistic | 367 |
| Heart Disease | 270 | 13 | Binary logistic | 781 |
| IJCNN1 | 49,990 | 22 | Binary logistic | 631 |
| Ionosphere | 351 | 34 | Binary logistic | 777 |
| Leukemia | 38 | 7,129 | Binary logistic | 598 |
| Liver Disorders | 145 | 5 | Binary logistic | 771 |
| Madelon | 2,000 | 500 | Binary logistic | 794 |
| Mushrooms | 8,124 | 112 | Binary logistic | 777 |
| Phishing Websites | 11,055 | 68 | Binary logistic | 761 |
| Skin/Non-Skin | 80,000 | 3 | Binary logistic | 701 |
| Sonar | 208 | 60 | Binary logistic | 778 |
| Splice | 1,000 | 60 | Binary logistic | 801 |
| SVMguide1 | 3,089 | 4 | Binary logistic | 771 |
| SVMguide3 | 1,243 | 22 | Binary logistic | 775 |
| w8a | 49,749 | 300 | Binary logistic | 290 |
| a9a | 32,561 | 123 | Binary logistic | 645 |
| Covertype (binary) | 80,000 | 54 | Binary logistic | 623 |