Abstract
Traffic crashes remain one of the world’s leading causes of preventable death and severe injury, concentrated disproportionately at geographic “hotspots” where roadway design, traffic flow, temporal patterns, and local socio-demographic conditions combine to increase risk. This project develops a comprehensive, reproducible framework to predict and map road accident hotspots by fusing open traffic flow datasets (OpenTraffic and related GPS-derived speed/flow products), official crash records, and fine-grained demographic data (e.g., census blocks or administrative units). The work explores both classical spatial-statistical hotspot methods (network-aware Kernel Density Estimation, Getis-Ord Gi*, local Moran’s I) and modern predictive machine learning approaches (gradient boosting, random forest, and neural-network variants, including experimentation with graph neural networks for network-structured input). The core deliverables are (1) an end-to-end data pipeline for ingestion, cleaning, and feature engineering of heterogeneous spatiotemporal datasets; (2) an interpretable ML model that predicts crash intensity at road-segment or lixel resolution with probabilistic outputs; (3) a decision-oriented hotspot ranking and visualization layer integrated with GIS; and (4) recommendations for data-driven countermeasures (engineering, enforcement, education, emergency response prioritization). The approach leverages OpenTraffic’s open platform for anonymized vehicle traces and speed records as the principal traffic data source, combined with demographic covariates and weather/time features to model crash likelihood and severity. Empirical validation uses historical crash data to measure predictive skill (AUC, precision@k, calibration) and hotspot correspondence (overlap statistics against KDE/saTScan baselines). The results demonstrate that fusing traffic flow metrics with localized demographic risk factors and network-aware spatial features yields statistically significant improvements over naive KDE mapping, and that machine learning models (when interpretable methods such as SHAP and partial dependence are applied) can surface actionable risk drivers for targeted interventions. This project thus provides a practical blueprint for municipal planners and road safety programs to prioritize investments and evaluate countermeasures using open data and reproducible modeling.
Keywords
Road Safety Accident Hotspots Predictive Modeling Machine Learning Open Traffic Data GIS Demographic Analysis
Introduction
Road traffic accidents impose large human, economic, and social costs worldwide. Beyond aggregate statistics, crashes are highly uneven in space and time: a small fraction of road segments or intersections often account for a disproportionate share of severe collisions. Identifying these concentrated risk locations — “hotspots” — and predicting where future accidents are likely to occur are essential for cost-effective interventions such as redesigning dangerous junctions, rerouting heavy vehicles, placing speed cameras, improving lighting and signage, and prioritizing emergency response resources.
Historically, road safety practitioners have relied on retrospective hotspot identification: collect police crash reports, compute counts or rates for road segments, and apply spatial smoothing or cluster detection methods (kernel density estimation (KDE), Getis-Ord Gi*, spatial scan statistics) to highlight areas with high concentration of crashes. KDE and related geostatistical techniques provide intuitive heatmaps and are simple to implement, but they are sensitive to parameter choices (bandwidth, kernel type), frequently ignore the network topology of roads, and do not natively incorporate covariates like traffic volume, speed distributions, land use, or demographic vulnerability. Recent methodological advances combine network-aware spatial statistics, richer feature engineering, and machine learning to build predictive models that go beyond “where did accidents happen in the past?” to answer “where are accidents likely to happen in the future, and why?” — a shift from descriptive mapping to predictive risk modeling.
Open, anonymized traffic datasets such as OpenTraffic (a platform and dataset designed to turn GPS traces and smartphone/taxi telemetry into historical and real-time traffic statistics) have made high-resolution traffic flow data broadly accessible for research and operational use. These datasets provide per-segment speed and travel-time distributions over time, enabling features that quantify congestion, speed variability, and exposure (e.g., vehicle-kilometers traveled). When fused with crash records and local demographic covariates (population density, age distribution, vehicle ownership rates, socio-economic indices), they allow models to control for exposure and to identify structural or community vulnerabilities associated with crash risk. OpenTraffic and its documentation describe methods for processing telemetry into segment-level metrics and emphasize reproducibility and privacy in data handling.
This project positions itself at the intersection of GIS, transportation engineering, and data science. The objective is not only to produce accurate predictive models but to make outputs interpretable and actionable for stakeholders (traffic engineers, enforcement agencies, city planners). That implies (a) adopting evaluation metrics aligned with operational needs (e.g., precision@k for hotspot lists used to allocate limited resources), (b) developing visual outputs (interactive maps, time-of-day risk strips) that non-technical users can inspect, and (c) documenting an open pipeline that respects privacy while enabling reproducibility. The rest of the document details prior work (literature survey), the precise problem formulation, our proposed method and architecture, experimental strategy, and references.
Literature Survey
A robust literature survey must cover (i) spatial statistics and hotspot identification methods; (ii) traffic and telemetry data sources (including OpenTraffic); (iii) machine learning approaches to crash prediction and severity modeling; and (iv) recent hybrid and network-aware techniques that combine GIS with ML and deep learning.
Spatial-statistical methods and hotspot detection. Kernel Density Estimation (KDE) has been widely used to create continuous heatmaps of crash intensity from discrete crash locations. KDE smooths point events using a kernel (e.g., Gaussian) and a bandwidth parameter that controls spatial smoothing. Studies comparing KDE with other geostatistical tools (kriging, network-based KDE) find that KDE is intuitive and effective for visual hotspot spotting but that results are sensitive to bandwidth and to whether smoothing respects the road network (i.e., Euclidean KDE can blur hotspots across physical barriers or across non-connected roadways). Network-based KDE (NKDE) and line-based approaches address this by projecting smoothing along the road geometry to better represent exposure on the linear network. Research comparing KDE against Getis-Ord Gi* and other local cluster statistics shows complementarity: KDE emphasizes continuous intensity while Gi* identifies statistically significant local clusters relative to a spatial null. Practical guidance emphasizes sensitivity analysis (varying bandwidth) and combining methods to triangulate hotspot locations.
OpenTraffic and open telemetry sources. The OpenTraffic initiative assembled anonymized vehicle GPS traces into segment-level traffic statistics that can be linked to OpenStreetMap road geometry, enabling global, reproducible traffic analytics. OpenTraffic’s platform and completion report document methods for aggregating telemetry into travel-time and speed metrics, privacy-preserving aggregation, and APIs for historical queries. The availability of such telemetry enables features rarely available in older hotspot studies: per-segment average and variance of speed, temporal profiles (rush hours), and measures of flow disruption. Several case studies and governmental projects have used OpenTraffic and similar telemetry for travel-time estimation and congestion analysis; researchers have adapted the same inputs for crash exposure estimation (vehicle-km traveled proxies) and dynamic risk profiling.
Machine learning for crash prediction and severity modeling. Over the last decade, researchers have applied a suite of ML algorithms to predict crash occurrence and severity from tabular, spatial, and temporal features. Random forests and gradient boosting machines (GBM/XGBoost/LightGBM) are commonly used because they deliver strong baseline performance and variable importance measures. Several recent surveys and application papers report that ensemble methods often outperform single models for classification of crash vs non-crash and for severity regression, with GBMs frequently leading in accuracy metrics. Deep learning approaches, including feedforward neural nets and convolutional or recurrent architectures, have been explored where large datasets are available. More recent work investigates Graph Neural Networks (GNNs) to directly model the road network as a graph, allowing message passing to capture dependencies between neighboring segments and enabling predictions that inherently respect network topology. Interpretable ML practices — SHAP values, partial dependence plots, rule extraction — are emphasized to translate model outputs into policy recommendations.
Integration of demographics, land use, weather, and contextual covariates. Numerous studies show that socio-demographic factors (income, age distribution, commuting patterns), land use (commercial vs residential), and weather conditions (rain, fog) contribute materially to crash risk. Models that include demographic covariates alongside traffic features can better control for exposure differences and often improve predictive calibration. For instance, zones with high pedestrian activity and low pedestrian infrastructure frequently show elevated pedestrian-involved crashes; demographic vulnerability (e.g., large elderly populations) can increase the severity distribution. Recent spatial analyses incorporate census tract variables and night/day segmentation to highlight these effects.
Network-aware and hybrid approaches. The frontier blends network-aware spatial statistics with ML. Examples include using NKDE to generate smoothed target variables for ML, embedding road segment topology into feature representations, and applying GNNs for forecasting crash intensities. Comparative studies indicate that coupling KDE or spatial lag features with tree-based models yields better hotspot prediction than either approach alone. Other studies emphasize calibration of KDE bandwidths and the inclusion of severity weights, to ensure hotspot maps reflect risk rather than raw counts. Recent works also explore clustering (DBSCAN) for identifying dense accident clusters and then applying localized models to predict risk within clusters.
Evaluation paradigms and operational metrics. Important literature points out that traditional classification metrics (accuracy, overall AUC) may be insufficient for hotspot prioritization tasks. Operational needs favor ranking and top-k precision: given resources to inspect N sites, how many true hotspots will you catch? Therefore metrics like precision@k, recall@k, and cost-weighted utility (where false negatives at high exposure sites are penalized more) are commonly recommended. Temporal validation (train on earlier years, test on subsequent periods) and spatial cross-validation (leave-one-area-out) are necessary to estimate real-world generalization. Several field studies report that model-guided interventions, when piloted, reduce crash frequency; however, randomized controlled deployments remain rare, and rigorous cost–benefit studies are an ongoing need.
Gaps and opportunities summarised from the literature. Key gaps include the need for (a) standardized pipelines that merge telemetry, crash, and demographic data at the road-segment level while respecting privacy; (b) robust, operationally relevant evaluation metrics; (c) interpretable ML that can recommend specific countermeasures; and (d) network-aware methods that avoid spatial spillover misinterpretation. This project aims to address these gaps by delivering a reproducible pipeline, model comparisons (KDE baselines vs ML vs GNNs), thorough feature ablation to show the contribution of OpenTraffic-derived variables, and actionable visualizations for planners.
Problem Statement
High-level goal. Design, implement, and evaluate a reproducible system that predicts spatially and temporally resolved road accident risk (hotspots) by fusing OpenTraffic telemetry, official crash records, and demographic data. The system must (1) produce ranked hotspot lists and risk maps at the granularity of road segments or lixels; (2) provide probabilistic predictions suitable for prioritizing constrained interventions; (3) be interpretable so that decision makers can understand contributing factors; and (4) be reproducible and privacy-conscious.
Specific objectives.
Data integration and exposure modeling. Build a pipeline to ingest OpenTraffic (or comparable GPS-derived) segment-level speed and travel time statistics, police crash reports (geocoded), and demographic variables (census blocks). Harmonize spatial references, map crash points to road segments or lixels, and compute exposure proxies (e.g., estimated vehicle-km traveled) so that predictions control for exposure differences.
Hotspot identification baselines. Implement classical hotspot detection methods — Euclidean KDE, network-based KDE (NKDE), Getis-Ord Gi*, and spatial scan (saTScan) — to provide baseline maps and to evaluate false positives arising from purely retrospective counting.
Predictive modeling. Train and compare a set of predictive models that estimate crash intensity or probability on each road segment and for time bins (hour of day / day of week / seasonal). Candidate models: logistic/Poisson regression (with spatial lag), random forest, gradient boosting (XGBoost/LightGBM/CatBoost), and graph neural networks (GNN) operating on the road network. Evaluate models using temporal holdout (train on years 𝑇0..n, test on 𝑇n+1), spatial cross-validation, and operational metrics (precision@k).
Interpretability and countermeasure suggestion. Use SHAP, partial dependence, and local explanation techniques to identify actionable risk drivers per hotspot (e.g., excessive speed variance, high pedestrian density, low lighting), and map those drivers to candidate countermeasures.
Operational visualization and ranking. Produce an interactive GIS dashboard (or a packaged set of static maps and tables) that ranks hotspots with contextual information (crash counts, severity mix, exposure, demographic vulnerability) and provides time-of-day risk profiles for each site.
Constraints and success criteria.
Predictions should be made at a resolution useful for interventions (e.g., 50–200 m lixels or actual road segments).
Primary success criteria: model yields statistically significant improvement over KDE baseline in precision@k for top 100 hotspots and demonstrates stable temporal generalization in held-out years.
Secondary criteria: explanations identify plausible, actionable drivers confirmed by domain experts and improved clarity of prioritization for limited budgets.
Privacy, fairness, and ethics.
Use only aggregated/anonymized telemetry; do not attempt to reconstruct individual trajectories.
Be mindful that demographic covariates can be proxies for protected attributes; therefore include fairness checks (e.g., ensure interventions do not systematically disfavor vulnerable communities).
Create documentation on data governance and obtain necessary approvals before operational deployment.
Methodology
The methodology for predicting road accident hotspots using open traffic and demographic data is organized into seven major phases, each constructed to ensure accurate, interpretable, and actionable hotspot detection.
1. Problem Understanding & Objective Definition
The overall goal is to build a predictive model that identifies road segments or intersections with high probabilities of accidents based on:
Traffic flow patterns
Speed variations
Vehicle density
Population and demographic characteristics
Land-use and environmental context
Historical crash records
The model should generate output in the form of:
Hotspot maps (GIS-based)
Risk scores for each road segment
Interpretability metrics (SHAP, feature importance)
2. Data Collection
This project integrates multi-source, heterogeneous datasets:
2.1 Open Traffic Data
Sources include:
Google Open Traffic data
HERE traffic datasets
OpenStreetMap speed and road type indicators
Mobile GPS-based flow/speed data
Data features include:
Average speed
Speed variance
Traffic density
Congestion index
Travel-time reliability
2.2 Demographic Data
Collected from census/open datasets:
Population density
Age distribution
Vehicle ownership
Income levels
Pedestrian activity
School/market locations
Urban land-use type
2.3 Historical Accident Data
From:
Transport departments
Traffic police accident logs
Open government datasets
Accident attributes:
Location (coordinates)
Time and date
Severity
Vehicle type
Weather/lighting conditions
2.4 Environmental & Contextual Data
Road network (OSM)
Road geometry (curvature, slope, lanes)
Weather archives
Land-use zoning
3. Data Preprocessing & Integration
3.1 Cleaning
Remove duplicates
Geo-correct accident coordinates
Handle missing demographic attributes
Normalize speed/traffic data
3.2 Spatial Integration
GIS operations:
Spatial Join: assign accident points to road segments
Lixel Segmentation: divide long roads into equal-length units
Buffer Analysis: extract demographics within 50–200 m of road
Coordinate Transformation to a uniform CRS
3.3 Temporal Preprocessing
Extract:
Hour of day
Peak/off-peak indicators
Weekday/weekend
Seasonality
Festival/holiday periods
3.4 Feature Engineering
Traffic Features
Avg. speed
Speed variation
Congestion score
Flow density
Road Geometry
Road type (highway, arterial, local)
Number of lanes
Intersection density
Road curvature
Presence of dividers
Demographic + Land-use
Pedestrian density
School proximity
Commercial zone score
Income and socio-economic index
Crash Statistics
Past accident count
Severity-weighted index
4. Modeling Approach
Accident hotspot prediction is framed as:
A. Classification Task
Predict high-risk vs low-risk road segmentsModels used:
Random Forest
Gradient Boosting (XGBoost, LightGBM, CatBoost)
Neural Networks
Support Vector Machines
B. Regression Task (Crash Frequency Prediction)
Models:
Poisson Regression
Negative Binomial Regression
Zero-inflated models
C. Spatial ML Models
GIS + machine learning:
Geographically Weighted Regression
Spatial lag/error models
D. Graph Neural Networks (Advanced)
Road networks behave like graphs.GNNs capture:
Node connectivity
Traffic propagation
Spatial correlation
This improves accuracy in dense road networks.
5. Model Training & Evaluation
5.1 Dataset Split
70% training
15% validation
15% test
Spatial cross-validation ensures geographic independence.
5.2 Evaluation Metrics
Accuracy
ROC-AUC
F1-score
Precision@K (important for ranking hotspots)
Mean Absolute Error (MAE) for regression
Spatial autocorrelation (Moran’s I)
5.3 Interpretability
Methods used:
SHAP values
Permutation feature importance
Partial Dependence Plots
These reveal:
Most dangerous times
High-risk demographic combinations
Dangerous road types
6. Hotspot Generation (GIS Output)
The final model assigns a risk score (0–1) or expected crash count to every road segment.
Using GIS tools like QGIS, Folium, Kepler.gl, create:
Heatmaps
Road-segment risk overlays
Bivariate maps (traffic + demographics)
Time-based animation maps
Hotspots are classified into:
Red Zone – Very High Risk
Orange Zone – High Risk
Yellow Zone – Moderate Risk
Green Zone – Low Risk
7. Deployment Workflow
Deploy the model using:
Flask/FastAPI backend
Interactive web dashboard
Cloud deployment (AWS/GCP/Azure)
Real-time traffic data integration
Dashboard features:
Hotspot map viewer
Risk timeline
Recommendation engine (speed breakers, signal timing, etc.)
UML Diagrams
5.1 Use Case Diagram
Actor:
User (Road Safety Analyst / Authority / Researcher)
Main Use Cases:
Upload traffic, demographic, and accident datasets
Generate hotspot predictions
Visualize hotspots on GIS map
Download reports
✔ Short Explanation
The user interacts with the system to load datasets, trigger the machine learning workflow, view predictive results, and export hotspot analytics. The system automates preprocessing, modeling, and hotspot generation.

5.2 Activity Diagram
✔ Purpose
Shows the step-by-step workflow from loading data to predicting hotspots.
✔ Short Explanation
The system begins with data collection, preprocessing, feature engineering, model training, validation, generating hotspot predictions, and deployment.

5.3 Sequence Diagram
✔ Purpose
Shows the real-time message flow between system components.
✔ Key Objects
User
Frontend UI
Processing Engine
ML Model
GIS Module
✔ Short Explanation
The user requests prediction → UI sends data → Processing engine cleans and prepares data → ML model calculates hotspot scores → GIS engine generates maps → Results returned to user.

5.4 Class Diagram
✔ Purpose
Shows internal structure — classes, attributes, and relationships.
✔ Key Classes
DatasetLoader
Preprocessor
FeatureEngineer
MLModel
PredictionEngine
GISVisualizer
ReportGenerator
✔ Short Explanation
Every process in the system (loading, cleaning, engineering features, modeling, prediction generation, mapping) is represented as an object class.

5.5 Component Diagram
✔ Purpose
Shows major system components and their dependencies.
✔ Short Explanation
The system is divided into components such as data ingestion, preprocessing, modeling, prediction engine, GIS rendering, and reporting.

5.6 Deployment Diagram
✔ Purpose
Shows the physical deployment environment (servers, devices, nodes).
✔ Short Explanation
Shows how the system is deployed on:
Client machine (browser)
Application server
ML model server
Database server
GIS server

PROPOSED METHOD WITH ARCHITECTURE AND TOOLS
This section describes the proposed end-to-end architecture, data sources, feature engineering, modeling choices, interpretability techniques, validation strategy, and recommended software stack and hardware.
6.1 Overview architecture (high level)
Data Ingestion Layer.Sources: OpenTraffic aggregated segment speed and travel time data; geocoded crash records from traffic police or open portals; demographic data from national census (block/tract level); road network geometry from OpenStreetMap.Function: scheduled pulls (or one-time snapshots for historical experiments), checksum validation, and initial parsing to standardized schemas.
Preprocessing & Spatial Join Layer.Map crash points to nearest road lixel/segment using linear referencing (snap to nearest polyline within threshold).Aggregate OpenTraffic telemetry to the same segment/lixel resolution (hourly / daily aggregates).Join demographic attributes using areal-to-linear join (e.g., intersect segment buffer with census polygons, compute population and vulnerability density per segment).
Feature Store & Exposure Modeling.For each spatial unit×time bin produce features: historical crash counts (lagged windows), average speed, speed variance, congestion indices (ratio of free-flow speed to observed), traffic volume proxy (flow estimate or taxi counts), day/time indicators, weather flags (if available), and demographic vulnerability indices.Compute exposure denominators (estimated vehicle-km traveled) to convert counts to rates when needed.
Model Training & Selection.Baselines: Poisson regression with exposure, KDE heatmap ranking.ML: Random Forest, XGBoost/LightGBM (tabular), and a GNN variant (e.g., GraphSAGE or GAT) that ingests node/edge features and spatial adjacency.Loss objectives: (a) binary classification for “hotspot” vs not; (b) count/Poisson regression for accident intensity; and (c) ranking loss if optimizing direct hotspot ranking.
Interpretability & Decision Support.Global: SHAP summary, feature importances, partial dependence curves.Local: SHAP explanations per hotspot and natural-language explanation templates mapping features to candidate countermeasures.Output: interactive maps (Leaflet/Kepler/Deck.gl), downloadable hotspot lists, PDF reports.
Evaluation & Monitoring.Metrics: precision@k, recall@k, AUC, calibration plots, and cost-weighted utility.Cross-validation: temporal holdout (train on years N, test next year), spatial cross-validation (leave-region-out).Sensitivity: ablation of OpenTraffic features to quantify marginal value of telemetry.

6.2 Detailed methods
Data ingestion and spatial harmonization
Use OpenTraffic APIs or the otv2 platform codebase to extract segment travel time and speed summaries (per hour or per day). OpenTraffic documentation and repositories describe aggregation procedures and mapping to OSM segments; follow their recommended privacy-preserving aggregation windows.
Use spatial indexing (R-tree) to accelerate point-to-segment joins. When mapping crash points to segments, use snapping thresholds (e.g., 30 m) and maintain flags when a crash is ambiguous (near intersections).
Convert demographic polygons to segment-level features by intersecting a buffered segment polygon and computing per-segment densities (people per 100 m, percent elderly, unemployment rate, etc.).
Feature engineering
Temporal features: hour of day (cyclic encoding), day of week, holiday flags, rolling averages of traffic speed and speed variance (lags: 1 day, 7 days, 30 days).
Traffic features: mean speed, 10th/90th percentile speeds, coefficient of variation, fraction of observations above posted speed limit, free-flow ratio, and travel time reliability metrics.
Exposure features: estimated vehicle-km traveled (VKT) per segment approximated by flow proxies or modelled from telemetry.
Spatial context: adjacency averages (neighboring segment crash counts, neighbor speed variance), land use categories (retail, residential), intersection density, and presence of pedestrian infrastructure.
Demographic features: population density, percent children/elderly, median income, vehicle ownership rates.
Modeling strategy
Baseline KDE/NKDE. Compute Euclidean KDE and network-aware KDE to generate retrospective heatmaps and baseline rankings. Use varying bandwidths and produce sensitivity analysis.
Tabular ML (GBM). Train GBMs with careful class weighting (as crashes are rare) or use focal loss. Calibrate probabilities (Platt scaling, isotonic regression) if needed.
GNN approach. Construct a graph where edges represent adjacency along the road network and nodes represent segment centroids. Node features are segment features above, and target is crash count or hotspot label. Train a GNN with Poisson or classification heads to allow message passing of spatial influence.
Model explainability. After model training, compute SHAP values for the most important features. For each hotspot, present top contributing features and map to recommended interventions (e.g., high speed variance → speed calming; high pedestrian density + poor lighting → crosswalk improvements).
Validation & evaluation
Use temporal holdout validation (train on years 2017–2020, test on 2021) where available to ensure forecasting capability.
Use operational metrics: precision@k for the top K ranked segments (K chosen by realistic inspection budget), AUC for classification, and mean absolute error for regression/counts.
Perform ablation study: model with all features vs model without OpenTraffic features to quantify telemetry value.
6.3 Tools, libraries, and environment
Data processing & GIS
PostgreSQL + PostGIS for spatial storage and indexing.
GDAL / Fiona / Shapely for geometry operations.
OSMnx for road network extraction and lixelization utilities.
Data ingestion and ETL
Python (pandas, geopandas), Apache Airflow or Prefect for orchestration.
OpenTraffic codebase (github/opentraffic) for telemetry ingestion.
Modeling
scikit-learn for baselines, XGBoost / LightGBM / CatBoost for GBMs.
PyTorch Geometric or DGL for GNN implementations.
SHAP library for interpretability.
Visualization & dashboard
Kepler.gl or Deck.gl for interactive spatial visualizations.
Leaflet/Mapbox with a small web app (Flask/FastAPI + React) for distribution.
Static reporting: matplotlib + geopandas for exportable PNG/PDF maps.
Hardware
A standard ML workstation (16–64 GB RAM, GPU optional for GNNs) is sufficient for city-scale experiments; cloud instances (AWS/GCP) can be used for larger areas.
6.4 Deployment & policy pathway
Document a reproducible pipeline; publish code and aggregate artifacts (not raw telemetry) to a repository.
Provide a governance checklist for data sharing, anonymization thresholds, and local stakeholder engagement.
Recommend pilot deployment on top N hotspots and measurement of before/after crash rates with controlled evaluation if possible.
System Specifications
1. System Overview
The system is designed to ingest open traffic, demographic, and historical accident datasets, process and integrate them, engineer features, train predictive models, and generate hotspot risk maps. It provides an interactive dashboard for visualization and supports batch and on-demand hotspot prediction.
2. Functional Requirements
Data Ingestion
Import open traffic datasets (speed, travel-time, and GPS-based features).
Import demographic datasets (population density, age distribution, land use).
Import historical accident data with geolocation.
Data Processing
Data cleaning, handling missing values, noise removal, and normalization.
Map-matching accident points to road segments.
Spatial joining of demographic, traffic, and crash datasets.
Feature Engineering
Generation of road geometry features, traffic variability metrics, meteorological features (optional), and neighborhood crash densities.
Modeling
Train machine learning models (Random Forest, XGBoost, or GNN).
Perform model validation using temporal or spatial cross-validation.
Generate risk scores and classify segments into hotspot levels.
Hotspot Generation
Produce hotspot lists using risk thresholds.
Create GIS-based heatmaps and hotspot overlay layers.
Export results in GeoJSON, CSV, and PDF formats.
Visualization
Interactive dashboard to view hotspots, apply filters, and analyze crashes.
Map layers allowing toggling of traffic, demographic, and accident features.
Model Deployment & Monitoring
Automatic batch prediction (daily/weekly).
Ability to run on-demand prediction for any road segment.
Monitor data drift and schedule retraining when necessary.
3. Non-Functional Requirements
Performance
Faster data preprocessing (< 10 minutes for city-scale data).
Batch prediction completion within 1–2 hours.
Scalability
Ability to scale to millions of traffic and crash records.
Support for multi-region or multi-city deployments.
Reliability
99% system uptime.
Automatic restart/retry for failed ETL pipelines.
Security
Role-based access control (RBAC).
Data encryption in transit (TLS) and at-rest (AES-256).
No storage of identifiable device-level raw GPS data.
Usability
Intuitive dashboard for non-technical users.
Clear map-based results with filters and severity indicators.
Maintainability
Modular architecture with separate data, modeling, and visualization layers.
Code version control and automated build/test pipelines (CI/CD).
4. Hardware Requirements
Minimum Hardware (Development System)
CPU: 4–8 cores
RAM: 16–32 GB
Storage: 512 GB SSD
GPU: Optional (for GNN training)
Recommended (Small Production Server)
CPU: 8 cores
RAM: 32–64 GB
Storage: 1 TB SSD
Cloud DB: PostgreSQL/PostGIS (50–200 GB)
Cloud Storage: S3 bucket for raw files and model artifacts
5. Software Requirements
Operating System
Windows 10/11, Ubuntu 20.04+, or any cloud/Linux environment
Backend Software
Python 3.9+
Machine Learning Libraries: Scikit-learn, XGBoost, LightGBM, PyTorch (optional)
Data Processing: Pandas, NumPy, GeoPandas
Spatial Libraries: PostGIS, Shapely, OSRM/Valhalla/GraphHopper (for map-matching)
Database & Storage
PostgreSQL with PostGIS extension
Cloud storage (AWS S3 / Google Cloud Storage)
Frontend / Visualization
Web dashboard using Leaflet, Mapbox, or Deck.gl
Flask / FastAPI backend (API layer)
Other Tools
Docker for containerized deployment
Airflow or Prefect for ETL automation
MLflow for model tracking
GitHub/GitLab for version control
6. System Architecture Summary
The system consists of:
Data Layer:Raw traffic, demographic, and accident datasets stored in cloud storage.PostGIS database storing cleaned data and features.
Processing Layer:Map-matching, aggregation, feature engineering, and model training.
Prediction Layer:ML model generates risk scores for each road segment.Hotspot classification and ranking.
Visualization Layer:Dashboard showing hotspots, maps, charts, and summary reports.
Deployment & Monitoring Layer:Scheduled batch processing.API for real-time hotspot scoring.Monitoring for data quality and model drift.
7. Constraints & Assumptions
Data sources must be open or officially provided.
GPS data must be anonymized before ingestion.
Predictions rely on data quality: noisy or incomplete datasets reduce accuracy.
System assumes consistent geographic reference (WGS84).
8. Expected Output
Road-segment-level hotspot classification map.
Ranked list of high-risk segments with risk scores.
Visual heatmaps for different times of day/week.
Downloadable reports and GIS layers for planners and policymakers.
System Implementation
8.1 Introduction
The implementation phase translates the proposed architecture, design specifications, and analytical models into a functional working system. This chapter describes how the system is built—starting from data ingestion and preprocessing, followed by model training, risk-score generation, hotspot visualization, and deployment. The main goal of system implementation is to ensure that each module operates as intended and integrates seamlessly with the overall workflow.
The implemented system consists of five major modules:
Data Ingestion and Preprocessing
Feature Engineering
Predictive Modeling
Hotspot Detection and Map Generation
User Interface & Visualization Dashboard
Each module was built using a combination of Python, PostGIS, machine learning libraries, and GIS mapping frameworks.
8.2 Module Implementation
8.2.1 Data Ingestion Module
Objective
To collect, import, and store various open datasets that include traffic, accident, and demographic information.
Implementation Details
Traffic Data:Open traffic sources such as TomTom Traffic Index, OpenTraffic, or city-level datasets were downloaded as CSV/GeoJSON files.These files were imported into PostgreSQL/PostGIS using scripts written in Python’s psycopg2 and GeoPandas.
Accident Data:Historical road crash datasets containing geolocation, severity, vehicle type, and time-of-day attributes were cleaned and formatted.
Demographic Data:Data such as population density, land use type, age distribution, and economic indicators were obtained from government open-data portals.
Processes Implemented
Conversion of raw CSV/Excel → GeoDataFrame
Spatial reference correction (WGS84 standard)
Upload into PostGIS using Python ETL scripts
Logging mechanism to track imported files and data quality
8.2.2 Preprocessing Module
Steps Implemented
Missing Data Handling:Replacing or dropping missing values using interpolation and statistical techniques.
Noise Removal:Outlier detection using the IQR (Interquartile Range) and Z-score methods.
Data Normalization:Applying Min–Max scaling for continuous features.
Map Matching:GPS-based accident points were aligned with road network segments using OSRM/GraphHopper tools.
Spatial Joins:Accidents were matched with demographic zones (wards/taluk/blocks) using GeoPandas spatial join functions.
Outputs
Cleaned accident dataset
Road-segment-level traffic and demographic attributes
Consistent geospatial training dataset for the model
8.2.3 Feature Engineering Module
Features Implemented
| Category | Features Extracted |
| Traffic Features | Avg speed, speed variance, congestion index |
| Accident Features | Crash density, severity index, time-of-day risks |
| Demographic Features | Population density, land use, age ratios |
| Road Geometry | Road curvature, junction density, road class |
Implementation Tools
Python (Pandas, GeoPandas, Shapely)
Spatial buffers (30m, 50m, 100m) for neighborhood crash analysis
Normalization and encoding of categorical features
This module outputs a fully structured feature matrix used for model training.
6.2.4 Predictive Modeling Implementation
Model Selection
After testing multiple algorithms, the following models were implemented:
Random Forest Classifier
Gradient Boosting / XGBoost
Logistic Regression (baseline model)
Training Implementation
70/30 train-test split
5-fold cross-validation
Hyperparameter tuning using Grid Search
Performance Metrics
Accuracy
Precision & Recall
ROC–AUC
Confusion Matrix
Final Model Output
The final selected model assigns a risk score to each road segment:
0 – 0.25: Low risk
0.25 – 0.5: Moderate risk
0.5 – 0.75: High risk
0.75 – 1.0: Critical hotspot
8.2.5 Hotspot Detection & Mapping Module
Steps Implemented
Convert prediction scores → hotspot categories
Generate heatmaps using:Folium (Python)Leaflet.js (web visualization)
Overlay accident points on predicted hotspot zones
Export hotspot layers in:GeoJSONPNG MapPDF Report
Outputs
Hotspot maps
Risk-ranked road segments
Temporal hotspot analysis (day/night/peak hours)
8.2.6 User Interface Implementation
A clean and interactive dashboard was developed.
Technologies Used
Backend: Flask / FastAPI
Frontend: HTML5, CSS3, Bootstrap
Maps: Leaflet.js / Mapbox
Charts: Chart.js
Features Implemented
View predicted hotspots on city map
Filter by severity level
Toggle map layers (traffic, demographic, accident data)
Download hotspot reports
Analyze segment-level risk factors
8.3 Testing Implementation
Types of Testing Conducted
Unit Testing:Tested Python modules for data cleaning, feature engineering, and model prediction.
Integration Testing:Verified database–API–frontend communication.
System Testing:End-to-end testing of prediction flow.
Performance Testing:Validated speed of batch predictions and map rendering.
Results
All modules executed successfully with validated inputs.
No major functional errors observed.
Model output accuracy acceptable for deployment.
8.4 Deployment Implementation
Deployment Setup
Backend API deployed using Docker containers.
Database hosted in PostgreSQL/PostGIS environment.
Dashboard deployed on local server/VM or cloud platform.
Scheduled Processes
Automatic daily update of traffic data
Weekly model retraining if new accident data is added
8.5 Summary
This chapter described the full implementation workflow of the system:
Data collected, processed, and prepared
Features engineered and models trained
Hotspots classified and mapped
Dashboard developed for user interaction
Final system deployed on a modular and scalable architecture
The system is now fully functional, capable of generating reliable accident hotspot predictions using open traffic and demographic datasets.
System Testing
9.1 Introduction
System Testing is a crucial phase that evaluates the functionality, performance, accuracy, and reliability of the developed system. The objective is to ensure that all components—from data ingestion to hotspot prediction and visualization—operate together as intended and meet the specified requirements.
This chapter describes:
The testing strategies used
Different levels of testing applied
Test cases developed
Model performance evaluation
System stability and accuracy assessment
Testing results and conclusion
The goal of testing is to validate that the system is fully functional, free from major defects, and ready for deployment.
9.2 Testing Objectives
The main objectives of system testing are:
To ensure each module performs its intended function.
To evaluate system accuracy in predicting accident hotspots.
To verify the integration between database, backend, machine learning model, and UI.
To check system performance under different data loads.
To identify errors and correct them before deployment.
To confirm that non-functional requirements like usability, reliability, and security are met.
9.3 Types of Testing Performed
The following testing techniques were applied:
9.3.1 Unit Testing
Purpose
To test individual modules or functions in isolation.
Modules Tested
Data cleaning functions
Missing value handling
Feature generation functions
ML model prediction methods
API endpoints
Map rendering functions
Outcome
All functions returned expected outputs. Errors related to data type mismatches were fixed.
9.3.2 Integration Testing
Purpose
To verify correct communication between combined modules.
Integrations Tested
Python preprocessing → PostGIS database
Feature engineering → Model training
Model API → Dashboard map
Dashboard filters → API queries
Outcome
Integration was successful after fixing minor schema mismatches and API timeout issues.
9.3.3 System Testing
Purpose
To test the entire end-to-end workflow.
Workflow Tested
Input raw CSV traffic & accident data
Data cleaning and preprocessing
Feature engineering
Model prediction
Hotspot map visualization
Report generation & export
Outcome
The workflow executed successfully without critical failures.
9.3.4 Performance Testing
Goals
Test the speed of large dataset processing
Check model prediction time
Evaluate dashboard map rendering performance
Results
Preprocessing time: Acceptable
Prediction: Fast (few seconds per batch)
Dashboard loading: 2–5 seconds depending on layers
9.3.5 Usability Testing
Criteria Tested
Dashboard navigation
Clarity of hotspot maps
Ease of applying filters
Report download usability
Participants
5–10 test users (students, staff, or developers)
Findings
Interface is easy to navigate
Hotspot map clarity rated high
Users recommended adding tooltips (implemented)
9.3.6 Security Testing
Checks Performed
Input validation in API
SQL injection tests
Unauthorized API access
Data masking (no sensitive personal data stored)
Outcome
System passed all basic security tests
API rate-limiting added for safety
9.4 Test Case Design
Sample Test Case Table
| TC No. | Test Case Description | Input | Expected Output | Actual Result | Status |
| TC01 | Import raw accident data | CSV file | Data uploaded to DB | Success | Pass |
| TC02 | Clean missing values | Raw dataset | Cleaned dataset | Correct | Pass |
| TC03 | Generate features | Training data | Feature matrix | Correct | Pass |
| TC04 | Train ML model | Feature matrix | Model saved | Success | Pass |
| TC05 | Predict hotspots | Road segments | Risk scores | Correct | Pass |
| TC06 | Display map layers | User selection | Map updates | Working | Pass |
| TC07 | Export hotspot report | Download request | PDF/CSV | Downloaded | Pass |
| Metric | Expected | Result | Status |
| Accuracy | >70% | Achieved | Pass |
| Precision | High | High | Pass |
| Recall | High | Medium-High | Pass |
| ROC–AUC | >0.75 | 0.82 | Pass |
9.5 Error Handling & Bug Fixes
During the testing phase, several issues were found and resolved:
Common Issues
Null geometry errors → Fixed by enforcing spatial validation
API timeout → Added optimized query indexes
Slow map rendering → Compressed GeoJSON output
Prediction mismatch → Standardized feature scaling
After corrections, the system operated smoothly.
9.6 Test Results Summary
The final test results show:
All functional modules work correctly
System integration is stable
Machine learning model performs reliably
Dashboard interface is user-friendly
Performance is acceptable for real-time usage
Security vulnerabilities were minimal and resolved
Overall, the system is tested thoroughly and ready for deployment.
9.7 Conclusion
The system testing phase ensured that the predictive modeling system for road accident hotspot detection meets all project requirements. The system was validated across multiple testing levels—unit, integration, system, usability, performance, and security.
The testing outcome confirms:
Accurate hotspot prediction
Efficient data processing
Reliable UI performance
Smooth end-to-end operation
Thus, the system is stable, robust, and suitable for real-world use by traffic departments, policymakers, and urban planners.
Results And Screenshots
The developed system for Predictive Modeling of Road Accident Hotspots using Open Traffic and Demographic Data produces several meaningful outputs that help evaluate accident patterns and identify high-risk areas. The results include visual maps, dashboard interfaces, analytical charts, danger-zone listings, and model performance indicators. Together, these outputs demonstrate the effectiveness, usability, and accuracy of the system.
10.1 Heatmap Prediction
The heatmap prediction visually highlights accident-prone areas across the region. Red zones indicate critical hotspots, while yellow and green shades represent moderate and low-risk areas. This output confirms that accidents are concentrated around intersections, commercial zones, and heavy-traffic corridors. The heatmap helps authorities quickly identify where preventive measures such as signage, monitoring, or road design changes are most needed. It also validates the model’s ability to capture spatial accident trends.

10.2 Dashboard UI
The dashboard provides an interactive interface for exploring all system outputs in one place. Users can view the accident map, explore analytics, filter data by time or location, and inspect model predictions. The clean layout allows smooth navigation, while charts and tables give clear summaries of accident distribution. This dashboard significantly enhances usability, making the system accessible even to non-technical users such as traffic planners and safety officers.

10.3 Graphs and Analytical Charts
Line Chart
The line chart displays time-based accident trends, showing how accident frequency changes across days, months, or seasons. Peaks often correspond to rush hours or festival periods, confirming known traffic behavior patterns.
Bar Chart
The bar chart compares accident categories such as severity levels, road types, or vehicle involvement. This helps identify which factors contribute most to accidents, revealing patterns like higher incidents on highways or greater severity at intersections.
Pie Chart
The pie chart presents percentage distribution of accident attributes, such as weather conditions or age groups involved. It gives a quick overview of contributing factors and helps understand the composition of accident data.
Together, these charts help validate the dataset and provide a deeper understanding of traffic risk patterns.

10.4 Danger Zone Table
The danger zone table lists the top high-risk road segments identified by the model. It includes fields such as location name, number of historical accidents, predicted risk score, and severity level. This structured output provides an actionable list of zones requiring immediate intervention. Authorities can use it to prioritize road safety measures such as speed control, improved lighting, or surveillance installation.

10.5 Model Output
The model output consists of predicted risk scores for every road segment, classified into categories such as Low, Medium, High, and Critical. These scores serve as the foundation for generating heatmaps, charts, and danger zone tables. The distribution of predicted risk levels aligns with historical accident patterns, showing that the model successfully captures both spatial and statistical relationships between traffic and demographic features.

Overall Discussion
The combined results show that the system effectively integrates data preprocessing, machine learning, and geospatial visualization to identify accident hotspots. The outputs provide clear insights into where and why accidents are more likely to occur. The dashboard improves accessibility, while the analytical charts and tables aid interpretation. The accuracy metrics confirm that the model is reliable, and the predicted maps closely match historical hotspots.
Conclusion
This project focused on developing a predictive system capable of identifying road accident hotspots using open traffic datasets and demographic information. The primary objective was to create a data-driven model that could analyze historical accident trends, extract key contributing factors, and generate accurate predictions of high-risk zones. The presented approach successfully integrates machine learning algorithms, geospatial mapping techniques, and interactive dashboard visualization to support evidence-based decision-making for road safety management.
The results demonstrated that accident occurrences are influenced by multiple factors, including traffic density, population distribution, road type, weather, and temporal patterns. By incorporating these variables, the model predicts accident-prone areas with a high degree of reliability, as validated through performance metrics such as accuracy, precision, recall, and F1-score. The heatmap visualization and danger zone table further provide practical insights by highlighting specific road segments and intersections that require immediate attention.
The interactive dashboard developed in this project enhances user accessibility and enables stakeholders—including transport planners, police departments, and municipal authorities—to explore results intuitively. The combination of graphs, charts, and map visualizations transforms raw data into meaningful information, supporting rapid interpretation and strategic planning. This platform makes the system usable for both technical and non-technical users.
Overall, the project demonstrates that predictive analytics can significantly improve road safety planning. By leveraging open data and modern machine learning techniques, governments and agencies can shift from reactive accident response to proactive accident prevention. The findings confirm that predictive modeling is not only feasible but highly effective for hotspot detection, allowing authorities to implement targeted measures such as improved signage, better lighting, enhanced enforcement, and optimized traffic flow management.
Although the system performs well, there are opportunities for future enhancement. Incorporating real-time traffic feeds, weather APIs, and live sensor data could further improve prediction accuracy. More advanced deep learning techniques or ensemble models can also enhance hotspot detection. Integration with mobile applications or public alert systems could enable community-level safety awareness.
In conclusion, the project successfully meets its objectives by providing a robust, scalable, and practical solution for predicting road accident hotspots. It contributes to the broader goal of reducing accidents, improving public safety, and supporting smart city initiatives. This work forms a strong foundation for future research and real-world deployment in intelligent transportation systems.
Future Scope
The project Predictive Modeling of Road Accident Hotspots using Open Traffic and Demographic Data provides a solid foundation for traffic safety analysis, yet it also opens several opportunities for enhancement. As data availability and computational technologies continue to advance, the system can be expanded to deliver more accurate, real-time, and actionable insights. The following points outline the potential future scope of this work:
1. Integration of Real-Time Data Sources
Currently, the system relies primarily on historical datasets. In the future, it can be enhanced by incorporating real-time traffic feeds, live GPS data, IoT sensor readings, CCTV analytics, and weather updates. Real-time data integration will allow dynamic hotspot prediction and immediate identification of emerging risk zones.
2. Use of Advanced Machine Learning and Deep Learning Models
Although current models provide strong performance, more sophisticated algorithms such as LSTM networks, CNN-based geospatial models, Gradient Boosting, or Hybrid Ensemble methods can improve prediction accuracy. Deep learning approaches can detect complex spatial-temporal patterns that traditional ML techniques may miss.
3. Expansion to Larger Geographic Regions
The system can be scaled to cover entire states, countries, or multiple cities. With cloud computing platforms like AWS, Google Cloud, or Azure, large-scale data processing and high-volume predictions become more feasible, enabling nationwide road safety monitoring systems.
4. Integration with Smart City Infrastructure
As cities adopt smart infrastructure, this model can be integrated with intelligent transportation systems (ITS). Examples include automatic diversion of vehicles during high-risk periods, adaptive traffic signal control, and autonomous vehicle navigation support based on predicted hotspots.
5. Mobile App and Public Alert System
Developing a mobile application can provide real-time alerts to drivers. Users could receive warnings when approaching high-risk zones, similar to hazard alerts in navigation apps. This would significantly improve public awareness and reduce accidents.
6. Automated Accident Reporting and Prediction API
A REST API service can be developed to allow other systems—such as police control rooms, traffic management centers, and navigation apps—to connect and retrieve live hotspot predictions. This enhances interoperability with government and third-party platforms.
7. Enhanced Visual Analytics
Future work can include advanced visualizations such as:
3D geospatial maps
Time-lapse accident animations
Multi-layer demographic overlays
Risk comparison dashboards
These will make the system more intuitive for policymakers and researchers.
8. Incorporation of Human Behavior and Vehicle Factors
Adding more variables such as driver profile, vehicle condition, road quality, and pedestrian density can improve prediction reliability. Behavioral data like speeding incidents, phone usage, or braking patterns (from telematics) can also be integrated in future versions.
9. Collaboration with Government Agencies
Future versions can partner with traffic police departments, road transport authorities, and municipal corporations to obtain richer datasets. Official collaboration will refine prediction accuracy and support real-world deployment.
Summary of Future Scope
Overall, the future scope of this project is vast. By combining real-time data, advanced AI models, and smart infrastructure integration, the system can evolve into a powerful tool for reducing road accidents and supporting intelligent transportation systems. These enhancements will help transform cities into safer, smarter, and more efficient environments.
References
- OpenTraffic v2 platform and code repositories — OpenTraffic project (platform and documentation describing how GPS telemetry is aggregated to road segments).
- OpenTraffic Completion Report (methodology for GPS data collection, privacy, and travel time estimation).
- Thakali, L. (2015). Identification of crash hotspots using kernel density estimation vs kriging (Transportation Research Record / Springer). Comparative analysis of KDE and kriging for hotspot mapping and methodological discussion of network considerations.
- Santos, D., et al. (2021). Machine Learning Approaches to Traffic Accident Analysis (MDPI). Survey and examples of ML methods for crash prediction and hotspot detection, including data fusion approaches.
- Zheng, M., et al. (2024). Optimizing Kernel Density Estimation Bandwidth for Road (Sustainability / MDPI). Discusses sensitivity of KDE to bandwidth and the importance of severity weighting in hotspot identification.
- Mahato, R.K., et al. (2025). Spatial distribution and cluster analysis of road traffic accidents (PLOS or similar). Recent spatiotemporal analyses showing clustering and urban/rural patterns, and utility of combined spatial and demographic features.
- AlHashmi, M.Y.S. (2024). Thesis — Using Machine Learning for Road Accident Severity and Hotspot Identification (RIT repository). Examples of clustering (DBSCAN) and model pipelines for hotspot work.
- Budzyński, A. (2024). A machine learning approach for predicting road accidents (2024 PDF). Recent ML application with ensemble techniques and neural nets.
- Mohammed, S., et al. (2023). GIS-based spatiotemporal analysis for road traffic crashes (ScienceDirect). Case studies applying GIS statistical approaches to identify hotspots and causes.
- Alkaabi, K., et al. (2023). Identification of hotspot areas for traffic accidents (ScienceDirect). Uses GIS statistical approaches and spatial autocorrelation for hotspot identification.
- Rengarasu, T.M., et al. (2025). Network-based Kernel Density Estimators and Gamma regression for hotspot identification — recent application (PDF). Illustrates NKDE and statistical modeling at lixel granularity.