🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)

This model predicts hourly bike rental demand (the target column count) from structured historical + weather/time features using AutoGluon’s TabularPredictor (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.

Repository: https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon

Model Details

Developed by: brej-29
Model type: AutoGluon TabularPredictor (tabular regression)
Target label: count
Problem type: regression
Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)

Intended Use

Educational / portfolio demonstration of:
- Kaggle-style regression workflow
- AutoML with AutoGluon
- Feature engineering from datetime fields
- Hyperparameter optimization (HPO) experiments
Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset

Out of scope:

Production forecasting without monitoring, retraining strategy, and strong input validation
High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis

Training Data

Dataset: Kaggle “Bike Sharing Demand”

Typical columns include:

Features: datetime, season, holiday, workingday, weather, temp, atemp, humidity, windspeed
Leakage columns present in train but not in test: casual, registered
Target: count

Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.

Preprocessing and Feature Engineering

datetime is parsed as a datetime type.
Leakage prevention:
- The notebook sets ignored_columns = ["casual", "registered"] because they are not available in the Kaggle test set and would cause leakage if used.
Feature engineering experiment:
- Additional time-derived features were created from datetime:
  - year, month, day, hour
- These were used in a follow-up training run to measure impact on performance.
AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).

Training Procedure

Base configuration used in the notebook:

TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")
Preset: best_quality
Time limit: 600 seconds (10 minutes)
Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)

Hyperparameter optimization (HPO) run:

Search controlled via hyperparameter_tune_kwargs:
- num_trials = 20
- searcher = "auto"
- scheduler = "local"
Hyperparameters were provided for:
- GBM (including extra-trees style trials + a larger preset config)
- XT (ExtraTrees)
- XGB (XGBoost)

Evaluation

Important note about AutoGluon leaderboard scores:

AutoGluon’s leaderboard displays metrics in “higher is better” format.
For RMSE, the displayed score_val is the negative RMSE (sign-flipped), so you can interpret:
- Validation RMSE ≈ absolute value of score_val

Offline validation (AutoGluon internal validation; best run from the notebook):

Best validation score_val: -39.953761 (root_mean_squared_error)
Interpreted validation RMSE: 39.953761

Kaggle public leaderboard (submissions generated from notebook):

Initial submission RMSLE: 1.42139
With added features submission RMSLE: 1.41560
With HPO submission RMSLE: 0.49145

How to Use

Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like AutogluonModels/<run_name>/) to your Hugging Face model repo.

Example inference pattern:

import pandas as pd
from huggingface_hub import snapshot_download
from autogluon.tabular import TabularPredictor

repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

# Download the whole repo snapshot (works well for AutoGluon folders)
local_dir = snapshot_download(repo_id=repo_id)

# Point this to the directory that contains the AutoGluon predictor artifacts
predictor = TabularPredictor.load(local_dir)

# Example input (use correct values and columns)
X = pd.DataFrame([{
    "datetime": "2012-12-19 17:00:00",
    "season": 4,
    "holiday": 0,
    "workingday": 1,
    "weather": 1,
    "temp": 10.0,
    "atemp": 12.0,
    "humidity": 60,
    "windspeed": 15.0
}])

preds = predictor.predict(X)
print(float(preds.iloc[0]))

If your trained model expects engineered columns (like year, month, day, hour), ensure you create them exactly the same way before calling predict().

Input Requirements

Input must be a tabular dataframe (pandas DataFrame recommended).
Required columns should match the Kaggle test schema used for training:
- datetime, season, holiday, workingday, weather, temp, atemp, humidity, windspeed
Do not include the ignored leakage columns at inference:
- casual, registered
If using engineered datetime columns in your final training run, ensure consistent feature generation:
- year, month, day, hour
Datatypes:
- numeric columns should be valid numeric types (int/float)
- missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)

Bias, Risks, and Limitations

This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
Kaggle data can contain seasonal/holiday patterns that may not generalize.
RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
If datetime parsing or feature generation differs from training, predictions may be unreliable.

Environmental Impact

AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.

Technical Specifications

Framework: AutoGluon Tabular (TabularPredictor)
Task: Tabular regression
Eval metric used in training: root mean squared error (RMSE)
Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)

Model Card Authors

BrejBala

Contact

For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Validation RMSE (best run, internal AutoGluon validation) on Kaggle Bike Sharing Demand (train.csv / test.csv)
self-reported

39.954
Kaggle Public Score (RMSLE, best submission) on Kaggle Bike Sharing Demand (train.csv / test.csv)
self-reported

0.491