How do road characteristics influence the congestion factor in urban road network?
Traffic congestion in urban areas is an ongoing issue influenced by numerous factors. In this analysis, we will use the R language and different network visualization methods to better understand the factors affecting traffic congestion in an urban network. Utilization of graph visualization, community detection, and statistical analysis will aid in identifying key determinants and inform evidence-based solutions aimed at mitigating congestion and improving overall transportation productivity.
CityTrans is a dataset of cities’ transportation system, available on Kaggle. It is aggregated from multiple sources and includes historical data spanning different time periods. The dataset encompasses various aspects of the transportation network for analyzing and understanding urban transportation dynamics.
Version 1 repository:
*Unused files are excluded.
Install and load these libraries:
library(tidyverse)
library(igraph)
library(ggraph)
library(tidygraph)
library(ggplot2)
library(lmtest)
Install the CityTrans repository into local working directory:
Load in these dataframes:
# Road Segments data variables: source, target length, speed_limit, lanes
roadseg_df <- read_csv("road_segments.csv")
# Traffic Flow data variables: source, target, vehicles, speed, congestion_factor
traffic_df <- read_csv("traffic_flow_data.csv")
Create a network visualization of the road segments, using the source and target data to form nodes and edges.
roadseg_graph <- graph_from_data_frame(d = roadseg_df, directed = FALSE)
set.seed(7)
plot(roadseg_graph,
main = "Road Network Visualization",
vertex.label = NA,
edge.arrow.size = 0.5,
vertex.size = 5,
layout = layout.fruchterman.reingold)
We can go one step further and highlight the road with the most connections, in-bound and out-bound:
# Compute degree centrality
degree_centrality <- degree(roadseg_graph)
most_connected_node <- which.max(degree_centrality)
adjacent_edges <- incident_edges(roadseg_graph, most_connected_node)
# Road 69 is the most connected road, but that road corresponds to index 63 on the degree_centrality df
# Highlight the most connected node in red, otherwise in yellow
highlight <- rep("lightgoldenrodyellow", vcount(roadseg_graph))
highlight[most_connected_node] <- "red"
# Highlight the adjacent edges in red, and non-adjacents in black
edge_colors <- rep("black", ecount(roadseg_graph))
edge_colors[unlist(adjacent_edges)] <- "red"
# Thicken adjacent edges for visual clarity
edge_widths <- rep(1, ecount(roadseg_graph))
edge_widths[unlist(adjacent_edges)] <- 2
# Label the road name of the most connected node
vertex_labels <- rep("", vcount(roadseg_graph))
# Enable this to label the most-connect road node
#vertex_labels[most_connected_node] <- as.character(names(degree_centrality)[which.max(degree_centrality)])
# Plot with highlighted node and edges
set.seed(7)
plot(roadseg_graph,
main = "Road Network Visualization (Most-Connection Highlighted)",
vertex.label = vertex_labels,
edge.arrow.size = 0.5,
vertex.size = 5,
layout = layout.fruchterman.reingold,
vertex.color = highlight,
edge.color = edge_colors,
edge.width = edge_widths)
# Legends
#legend_labels <- c("Road", "Most-Connected Road")
legend_labels <- c("Road", paste("Road", names(degree_centrality)[which.max(degree_centrality)]))
legend_title <- "Legend"
legend("bottomright",
legend = legend_labels,
title = legend_title,
pch = c(21),
lty = c(1),
pt.bg = c("lightgoldenrodyellow", "red"),
)
This graph reveals that Road 69 has the highest degree of 9. In the context of road segments, this signifies that there are 9 total incoming and outgoing connections linked to this road.
We can further confirm this information by analyzing the top ten nodes with the highest degree:
#The decimals will be multiplied by 100, meaning that the final result will be on a scale of 0 to 100%.
# Betweenness centrality
betweenness_centrality <- betweenness(roadseg_graph)
betweenness_percentage <- (betweenness_centrality / max(betweenness_centrality)) * 100
# Closeness centrality
closeness_centrality <- closeness(roadseg_graph) * 100
# Combine centrality measures into a data frame
centrality_df <- data.frame(
node = V(roadseg_graph)$name,
degree = degree_centrality,
betweenness = betweenness_percentage,
closeness = closeness_centrality
)
centrality_df <- centrality_df[order(centrality_df$degree, decreasing = TRUE),]
# Display the centrality measures
print(head(centrality_df, 10))
## node degree betweenness closeness
## 69 69 9 87.36695 0.3636364
## 15 15 7 63.21845 0.3584229
## 20 20 7 79.74160 0.3144654
## 22 22 7 75.61862 0.3448276
## 56 56 7 83.56184 0.3460208
## 66 66 7 97.62126 0.3521127
## 70 70 7 67.34560 0.3184713
## 81 81 7 79.36368 0.3558719
## 93 93 7 94.02181 0.3508772
## 90 90 7 61.65184 0.3174603
The dataframe above confirms our finding. Furthermore, there are two centrality variables that we can use to assess node importance:
While the visualization of the road network offers a broad overview of its structure, including road nodes and connections, the visual density creates clutter and thus limits any further detailed analysis. We can explore this dataset with other visual models to gain different angles of insights.
Implementations of community detection algorithms can help identify clusters or communities within the road network are more densely connected. That information can then be further subjected to statistical analysis to reveal critical factors that influence overall traffic flow and congestion.
#Preparing Multilevel (Louvain Method), Edge Betweenness (Girvan-Newman Algorithm), and Fast Greedy (Clauset-Newman-Moore Algorithm)
community_multilevel <- cluster_louvain(roadseg_graph)
community_edge_betweenness <- cluster_edge_betweenness(roadseg_graph)
community_fast_greedy <- cluster_fast_greedy(roadseg_graph)
Calculating modularity score:
modularity_multilevel <- modularity(community_multilevel)
modularity_edge_betweenness <- modularity(community_edge_betweenness)
modularity_fast_greedy <- modularity(community_fast_greedy)
Plotting the three different community detection algorithm:
# Set up the plotting layout to have 3 plots in one row
par(mfrow = c(1, 3), mar = c(5, 4, 4, 2) + 0.1)
# Plot the road network with communities identified by Multilevel
set.seed(42)
plot(community_multilevel, roadseg_graph,
layout = layout_with_fr,
vertex.size = 5,
vertex.label = NA,
edge.color = "black",
main = "Multilevel")
mtext(paste("Modularity:", round(modularity_multilevel, 3)), side = 1, line = 4, adj = 0.5)
# Plot the road network with communities identified by Edge Betweenness
set.seed(42)
plot(community_edge_betweenness, roadseg_graph,
layout = layout_with_fr,
vertex.size = 5,
vertex.label = NA,
edge.color = "black",
main = "Edge-Betweenness")
mtext(paste("Modularity:", round(modularity_edge_betweenness, 3)), side = 1, line = 4, adj = 0.5)
# Plot the road network with communities identified by Fast Greedy
set.seed(42)
plot(community_fast_greedy, roadseg_graph,
layout = layout_with_fr,
vertex.size = 5,
vertex.label = NA,
edge.color = "black",
main = "Fast-Greedy")
mtext(paste("Modularity:", round(modularity_fast_greedy, 3)), side = 1, line = 4, adj = 0.5)
These graphs reveal the structure of the road network, showing multiple overlapping communities existing.
Modularity measures the strength of communities, with higher value (1.0) signifying ddense connections within communities, and the opposite for connections between cross-communities.
The modularity score result can be interpreted as follows:
Another visualization that we can utilize is to mark each road segments based on the congestion factor, which is measure indicating the level of traffic congestion, derived from historical traffic data for different.
Start with the same network visualization as before, but employ color-coding based on the node’s congestion factor:
# Reset the plotting layout
par(mfrow = c(1, 1))
# Nodes colored by congestion factor (3, 2, 1)
vertex_colors <- ifelse(traffic_df$congestion_factor > 3, "firebrick2",
ifelse(traffic_df$congestion_factor > 2, "goldenrod1", "forestgreen"))
# Plot traffic flow on the road network with colored nodes
set.seed(2)
plot(roadseg_graph,
edge.color = "black",
vertex.size = 5,
vertex.label = NA,
vertex.color = vertex_colors, # Use defined colors
layout = layout.fruchterman.reingold,
main = "Traffic Flow by Congestion Visualization")
# Legend
legend_labels <- c("0-1", "2-3", "3+")
legend_title <- "Congestion Level"
legend("bottomright",
legend = legend_labels,
title = legend_title,
pch = c(21),
lty = c(1),
pt.bg = c("forestgreen", "goldenrod1","firebrick2"),
)
As mentioned previously, the dense cluttering of this data through this network visualization model makes it difficult to discern any worthwhile information at first glance.
Instead, let’s transform this dataset into a scatterplot model:
# Define colors based on congestion factor
traffic_df$color <- ifelse(traffic_df$congestion_factor > 3, "3+",
ifelse(traffic_df$congestion_factor > 2, "2 - 3", "0 - 1"))
# Plot traffic flow using ggplot2
ggplot(traffic_df,
aes(x = vehicles, y = congestion_factor, color = color)) +
geom_point(size = 3) +
scale_color_manual(values = c("3+" = "firebrick2",
"2 - 3" = "goldenrod1",
"0 - 1" = "forestgreen"),
name = "Congestion Level") +
labs(x = "Number of Vehicles",
y = "Congestion Factor",
title = "Road Congestion by Vehicles Count Visualization",
caption = "Data Source: CityTrans") +
theme_minimal() +
theme(legend.position = "bottom")
This scatterplot visualization provides a better insight into the distribution of congestion level. We can infer that there are a higher count of roads with congestion factor above 1, with a wide vehicles count distribution ranging from 0 to 100 across the network. But still, there doesn’t seem to be any significant trend or patterns that can be easily discerned.
However, to better understand statistical significance of this observation, we need to utilize rigorous statistical analysis methods to determine whether this observation is statistically significant, or is a coincidence/anomaly.
Performing statistical analysis can help identify and understand the relationship between different variables in our CityTrans dataset, such as the correlation between congestion level and other factors such as the length of the road, or how many lanes it has.
To start, perform a fitted regression model of the dataset. Since we have two different datasets sharing the same road segments (nodes), but different unique attributes, we will merge them together:
merged_data <- merge(roadseg_df, traffic_df, by = c("source", "target"))
# Subset merged dataset to include relevant columns
subset_data <- merged_data %>%
select(congestion_factor, length, speed_limit, lanes, vehicles, speed)
# Fit linear regression model (congestion factor as the predictor variable)
model <- lm(congestion_factor ~ length + speed_limit + lanes + vehicles + speed, data = subset_data)
# Acquiring other relevant statistics
congestion_mean <- mean(traffic_df$congestion_factor)
congestion_median <- median(traffic_df$congestion_factor)
# Outputs regresion model result
summary(model)
##
## Call:
## lm(formula = congestion_factor ~ length + speed_limit + lanes +
## vehicles + speed, data = subset_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98710 -0.27905 -0.03216 0.30354 1.05683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7785852 0.1850960 20.414 <2e-16 ***
## length 0.0090293 0.0099983 0.903 0.3676
## speed_limit 0.0032417 0.0035538 0.912 0.3628
## lanes -0.4835076 0.0252054 -19.183 <2e-16 ***
## vehicles 0.0011724 0.0009725 1.206 0.2295
## speed -0.0046606 0.0021130 -2.206 0.0286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3964 on 194 degrees of freedom
## Multiple R-squared: 0.6719, Adjusted R-squared: 0.6634
## F-statistic: 79.45 on 5 and 194 DF, p-value: < 2.2e-16
# Outputs congestion factor's mean and median statistics
message(paste("Mean of Congestion Factor:", congestion_mean,
"\nMedian of Congestion Factor:", congestion_median))
## Mean of Congestion Factor: 2.75420312228909
## Median of Congestion Factor: 2.76843110181654
The summary output of our fitted regression model uses congestion factor as the predictor variable, and contains several key components for our analysis.
Residual is the differences between observed values of the dependent variable, against the predicted values in the model. In general, we aim to have a tight residual spread around 0, to ensure that the model isn’t displaying prediction errors.
A residual standard error of 0.3964 on 194 degrees of freedom shows that although our model sample size is fair, the observed variables aren’t necessarily capable of capturing all the variabilities, but the Adjusted R-squared of 0.6634 indicates that approximately 66.34% of the variability in the congestion factor can be explained using this model.
The median residual value of -0.03216, meaning that the model’s predictions are generally quite close to the observed values. Evidently, there are outlier residuals in both extremes, thus causing over- and under-prediction of the congestion factor by -0.9871 and 1.05683, respectively. We can visualize this with a scatterplot:
subset_data$residuals <- residuals(model)
# Identify top 10 positive outliers (under-predictions)
top_outliers <- head(order(subset_data$residuals, decreasing = TRUE), 10)
# Identify top 10 negative outliers (over-predictions)
bottom_outliers <- head(order(subset_data$residuals), 10)
# Create a residual plot using ggplot2
ggplot(subset_data, aes(x = congestion_factor, y = residuals)) +
geom_point(color = ifelse(row.names(subset_data) %in% top_outliers | row.names(subset_data) %in% bottom_outliers, "red", "black")) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Add a dashed line at y = 0
labs(x = "Congestion Factor", y = "Residuals", title = "Congestion Factor Residual Plot (10 Top/Bottom Residuals Highlighted)")
Taken from the previous fitted regression model:
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7785852 0.1850960 20.414 <2e-16 ***
## length 0.0090293 0.0099983 0.903 0.3676
## speed_limit 0.0032417 0.0035538 0.912 0.3628
## lanes -0.4835076 0.0252054 -19.183 <2e-16 ***
## vehicles 0.0011724 0.0009725 1.206 0.2295
## speed -0.0046606 0.0021130 -2.206 0.0286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this model, the congestion factor serves as the predictor variable, implying that we will observe the impacts of other factors on this variable. The factors are as follows:
The fitted regression model shows the relationship between the predictor (congestion factor) variable against an independent variable, but what about the relationship between an independent variable against others? To see the strength and direction of linear relationship between any two given numeric variables, we can form a relationship matrix using the same data as above, and use a -1 to 1 range to indicate negative/positive linear relationship, while values close to 0 would indicate weak or no linear relationship. We do this by performing a correlation coefficient matrix:
options(scipen = 999)
cor(subset_data)
## congestion_factor length
## congestion_factor 1.00000000 0.0803482312266214082630
## length 0.08034823 1.0000000000000000000000
## speed_limit -0.06855566 -0.0067472999815900392903
## lanes -0.81269287 -0.0616198799158781049257
## vehicles 0.13171087 -0.0175869407624822289560
## speed -0.15597566 0.0512674858580876510739
## residuals 0.57282267 0.0000000000000001066947
## speed_limit lanes
## congestion_factor -0.0685556602208557408495 -0.81269286945494578766613
## length -0.0067472999815900392903 -0.06161987991587810492566
## speed_limit 1.0000000000000000000000 0.08527346449708933517897
## lanes 0.0852734644970893351790 1.00000000000000000000000
## vehicles -0.0622245940424820814796 -0.10500949748596055677297
## speed 0.3830088452800300413692 0.09301196229960378980550
## residuals 0.0000000000000003989702 0.00000000000000009024039
## vehicles speed
## congestion_factor 0.13171087346286755592750 -0.15597565859919718112003
## length -0.01758694076248222895598 0.05126748585808765107386
## speed_limit -0.06222459404248208147958 0.38300884528003004136920
## lanes -0.10500949748596055677297 0.09301196229960378980550
## vehicles 1.00000000000000000000000 -0.01011177003758201471684
## speed -0.01011177003758201471684 1.00000000000000000000000
## residuals 0.00000000000000001271505 0.00000000000000003474927
## residuals
## congestion_factor 0.57282266918204027827954
## length 0.00000000000000010669469
## speed_limit 0.00000000000000039897025
## lanes 0.00000000000000009024039
## vehicles 0.00000000000000001271505
## speed 0.00000000000000003474927
## residuals 1.00000000000000000000000
In our residual model, hetereoscedasticity can occur when variabilities is not constant across all levels of the independent variables. A high test statistic (with strong confidence) can indicate that the residual have non-constant variance, which can affect the reliability of the model’s estimates and inferences. We do this by performing the Breusch-Pagan test:
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 9.4056, df = 5, p-value = 0.09394
The result is the test statistics of 9.4056 (5 df), with a p-value of <0.1. Thus, we can infer that there is no strong evidence suggesting that the residuals does exhibit heteroscedasticity.
Another test we can do is to assess the normality of the residuals in our linear regression model. We can do this by performing the Shapiro-Wilk Normality test:
shapiro.test(residuals(model))
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.99424, p-value = 0.6358
The result is the test statistic of 0.99424, but with a high p-value of 0.6358. Thus, we can infer that there is no strong evidence to suggest that the residual does indeed deviate from a normal distribution significantly.
In this analysis, we explored the factors influencing traffic congestion using the CityTrans dataset. Through graph visualization, community detection, and statistical analysis, we gained insights into the dynamics of traffic congestion, and also potential underlying determinants.
The three key findings in our analysis reveals:
Overall, our analysis contributes a better understanding of congestion in urban areas, and provides valuable insights for policymakers, urban planners, and transportation authorities. By addressing the key factors revealed through these analysis, such as road design and management strategies, we can work towards alleviating congestion and improving the efficiency of urban transportation systems.
While our analysis provides valuable insights, it comes with noticeable limitations. One limitation of historical-data is that is does not capture real-time fluctations in traffic patterns, especially when future policies and changes have altered the physical condition of the road after the data has been taken. Moreover, the analysis emphasizes road characteristics for the sake of simplicity, overlooking external factors such as weather conditions and individual behavior, despite the availability of this information in the accompanying dataset within the same repository.
Another limitation is the the data transparency and accuracy. The dataset used in this analysis is recently published on Kaggle and sourced from a reputable author, but at the moment, the author have not provided direct links or references to the traffic data sources. The only information revealed about the metadata by the author is the recording of real-time and historical traffic flow information, the latter for traffic congestion data, and randomization for time delays, weather and air quality data. This makes it challenging to verify the accuracy and reliability of the underlying data, as it limits the ability of researchers to independently verify the findings and assess the quality of the data outside of this repository and author’s claim.
[1] CityTrans Dataset. Retrieved from Kaggle