Angelos Chatzimparmpas
3d4cfd24e2
|
4 years ago | |
---|---|---|
.vscode | 5 years ago | |
__pycache__ | 4 years ago | |
backend | 5 years ago | |
data | 4 years ago | |
frontend | 4 years ago | |
.gitignore | 4 years ago | |
LICENSE | 4 years ago | |
README.md | 4 years ago | |
insertMongo.py | 4 years ago | |
package-lock.json | 5 years ago | |
requirements.txt | 4 years ago | |
run.py | 4 years ago |
README.md
StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics
This Git repository contains the code that accompanies the research paper "StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics". The details of the experiments and the research outcome are described in the paper.
Note: StackGenVis is optimized to work better for standard resolutions (such as 1440p/QHD (Quad High Definition) and 1080p). For lower resolutions, our recommendation is to use the collapsible functionality of the top dark gray panels. Finally, any other resolution might need manual adjustment of your browser's zoom level to work properly.
Note: The tag paper-version
matches the implementation at the time of the paper's publication. The current version might look significantly different depending on how much time has passed since then.
Note: As any other software, the code is not bug free. There might be limitations in the views and functionalities of the tool that could be addressed in a future code update.
Data Sets
All publicly available data sets used in the paper are in the data
folder, formatted as comma separated values (csv).
Most of them are available online from the UCI Machine Learning Repository: Iris and Heart Disease. We also used a collection of data related to sentiment/stance detection in texts. This data set is not included due to permission issues, since it was parsed from well-known social media platforms by our group.
Requirements
For the backend:
- Python 3
- Flask
- Other packages:
Flask-PyMongo
,flask_cors
,mlxtend
,imblearn
,joblib
,numpy
,scikit-learn
,scikit-learn-extra
,sk-dist
,eli5
,umap-learn
, andpandas
.
You can install all the backend requirements with the following command:
pip install -r requirements.txt
For the frontend:
There is no need to install anything for the frontend, since all modules are in the repository.
Usage
Below is an example of how you can get StackGenVis running using Python for both frontend and backend. The frontend is written in JavaScript/HTML, so it could be hosted in any other web server of your preference. The only hard requirement (currently) is that both frontend and backend must be running on the same machine.
# first terminal: hosting the visualization side (client)
# with Node.js
cd frontend
npm run dev
# second terminal: hosting the computational side (server)
FLASK_APP=run.py flask run
# (optional) recommendation: use insertMongo script to add a data set in Mongo database
# for Python3
python3 insertMongo.py
Then, open your browser and point it to localhost:8080
. We recommend using an up-to-date version of Google Chrome.
Hyper-Parameters per Algorithm
Base classifiers:
- K-Nearest Neighbor: {'n_neighbors': list(range(1, 25)), 'metric': ['chebyshev', 'manhattan', 'euclidean', 'minkowski'], 'algorithm': ['brute', 'kd_tree', 'ball_tree'], 'weights': ['uniform', 'distance']}
- Support Vector Machine: {'C': list(np.arange(0.1,4.43,0.11)), 'kernel': ['rbf','linear', 'poly', 'sigmoid']}
- Gaussian Naive Bayes: {'var_smoothing': list(np.arange(0.00000000001,0.0000001,0.0000000002))}
- Multilayer Perceptron: {'alpha': list(np.arange(0.00001,0.001,0.0002)), 'tol': list(np.arange(0.00001,0.001,0.0004)), 'max_iter': list(np.arange(100,200,100)), 'activation': ['relu', 'identity', 'logistic', 'tanh'], 'solver' : ['adam', 'sgd']}
- Logistic Regression: {'C': list(np.arange(0.5,2,0.075)), 'max_iter': list(np.arange(50,250,50)), 'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'penalty': ['l2', 'none']}
- Linear Discriminant Analysis: {'shrinkage': list(np.arange(0,1,0.01)), 'solver': ['lsqr', 'eigen']}
- Quadratic Discriminant Analysis: {'reg_param': list(np.arange(0,1,0.02)), 'tol': list(np.arange(0.00001,0.001,0.0002))}
- Random Forests: {'n_estimators': list(range(60, 140)), 'criterion': ['gini', 'entropy']}
- Extra Trees: {'n_estimators': list(range(60, 140)), 'criterion': ['gini', 'entropy']}
- Adaptive Boosting: {'n_estimators': list(range(40, 80)), 'learning_rate': list(np.arange(0.1,2.3,1.1)), 'algorithm': ['SAMME.R', 'SAMME']}
- Gradient Boosting: {'n_estimators': list(range(85, 115)), 'learning_rate': list(np.arange(0.01,0.23,0.11)), 'criterion': ['friedman_mse', 'mse', 'mae']}
Meta-learner:
- Logistic Regression with the default Sklearn hyper-parameters. By that time, the core hyper-parameter tuples were: C=1.0, max_iter=100, solver='lbfgs', and penalty='l2'.
Corresponding Author
For any questions with regard to the implementation or the paper, feel free to contact Angelos Chatzimparmpas.