Research Question

How do road characteristics influence the congestion factor in urban road network?

Summary

Traffic congestion in urban areas is an ongoing issue influenced by numerous factors. In this analysis, we will use the R language and different network visualization methods to better understand the factors affecting traffic congestion in an urban network. Utilization of graph visualization, community detection, and statistical analysis will aid in identifying key determinants and inform evidence-based solutions aimed at mitigating congestion and improving overall transportation productivity.

About the Data

CityTrans is a dataset of cities’ transportation system, available on Kaggle. It is aggregated from multiple sources and includes historical data spanning different time periods. The dataset encompasses various aspects of the transportation network for analyzing and understanding urban transportation dynamics.

Data Files Used

Version 1 repository:

  • road_segments.csv
  • traffic_flow_data.csv

*Unused files are excluded.

Data Features

  • Source: Starting point of a road segment data.
  • Target: End point of the same road segment.
  • Length: The physical distance of the road segment, measured from source to target.
  • Speed Limit: The maximum allowed speed.
  • Lanes: The total number of lanes.
  • Vehicles: The count of vehicles observed using the road segment.
  • Congestion Factor: A measure indicating the level of traffic congestion, derived from historical traffic data for different time periods.

Preparing the Data

Install and load these libraries:

library(tidyverse)
library(igraph)
library(ggraph)
library(tidygraph)
library(ggplot2)
library(lmtest)

Install the CityTrans repository into local working directory:

“CityTrans” by suraj (Kaggle)

Load in these dataframes:

# Road Segments data variables: source, target length, speed_limit, lanes
roadseg_df <- read_csv("road_segments.csv")

# Traffic Flow data variables: source, target, vehicles, speed, congestion_factor
traffic_df <- read_csv("traffic_flow_data.csv")

Graph Visualization

Network Visualization I

Create a network visualization of the road segments, using the source and target data to form nodes and edges.

roadseg_graph <- graph_from_data_frame(d = roadseg_df, directed = FALSE)
set.seed(7)
plot(roadseg_graph, 
     main = "Road Network Visualization",
     vertex.label = NA,
     edge.arrow.size = 0.5,
     vertex.size = 5,
     layout = layout.fruchterman.reingold)

Network Visualization II

We can go one step further and highlight the road with the most connections, in-bound and out-bound:

# Compute degree centrality
degree_centrality <- degree(roadseg_graph)
most_connected_node <- which.max(degree_centrality)
adjacent_edges <- incident_edges(roadseg_graph, most_connected_node)
# Road 69 is the most connected road, but that road corresponds to index 63 on the degree_centrality df

# Highlight the most connected node in red, otherwise in yellow
highlight <- rep("lightgoldenrodyellow", vcount(roadseg_graph))
highlight[most_connected_node] <- "red"

# Highlight the adjacent edges in red, and non-adjacents in black
edge_colors <- rep("black", ecount(roadseg_graph))
edge_colors[unlist(adjacent_edges)] <- "red"

# Thicken adjacent edges for visual clarity
edge_widths <- rep(1, ecount(roadseg_graph))
edge_widths[unlist(adjacent_edges)] <- 2

# Label the road name of the most connected node
vertex_labels <- rep("", vcount(roadseg_graph))

# Enable this to label the most-connect road node
#vertex_labels[most_connected_node] <- as.character(names(degree_centrality)[which.max(degree_centrality)])

# Plot with highlighted node and edges
set.seed(7)
plot(roadseg_graph, 
     main = "Road Network Visualization (Most-Connection Highlighted)",
     vertex.label = vertex_labels,
     edge.arrow.size = 0.5,
     vertex.size = 5,
     layout = layout.fruchterman.reingold,
     vertex.color = highlight,
     edge.color = edge_colors,
     edge.width = edge_widths)

# Legends
#legend_labels <- c("Road", "Most-Connected Road")
legend_labels <- c("Road", paste("Road", names(degree_centrality)[which.max(degree_centrality)]))
legend_title <- "Legend"
legend("bottomright", 
       legend = legend_labels,
       title = legend_title,
       pch = c(21),
       lty = c(1),
       pt.bg = c("lightgoldenrodyellow", "red"),
       )

This graph reveals that Road 69 has the highest degree of 9. In the context of road segments, this signifies that there are 9 total incoming and outgoing connections linked to this road.

We can further confirm this information by analyzing the top ten nodes with the highest degree:

#The decimals will be multiplied by 100, meaning that the final result will be on a scale of 0 to 100%.

# Betweenness centrality
betweenness_centrality <- betweenness(roadseg_graph)
betweenness_percentage <- (betweenness_centrality / max(betweenness_centrality)) * 100

# Closeness centrality
closeness_centrality <- closeness(roadseg_graph) * 100

# Combine centrality measures into a data frame
centrality_df <- data.frame(
  node = V(roadseg_graph)$name,
  degree = degree_centrality,
  betweenness = betweenness_percentage,
  closeness = closeness_centrality
)

centrality_df <- centrality_df[order(centrality_df$degree, decreasing = TRUE),]


# Display the centrality measures
print(head(centrality_df, 10))
##    node degree betweenness closeness
## 69   69      9    87.36695 0.3636364
## 15   15      7    63.21845 0.3584229
## 20   20      7    79.74160 0.3144654
## 22   22      7    75.61862 0.3448276
## 56   56      7    83.56184 0.3460208
## 66   66      7    97.62126 0.3521127
## 70   70      7    67.34560 0.3184713
## 81   81      7    79.36368 0.3558719
## 93   93      7    94.02181 0.3508772
## 90   90      7    61.65184 0.3174603

The dataframe above confirms our finding. Furthermore, there are two centrality variables that we can use to assess node importance:

  • Betweenness Centrality: The extent of which a node lies on the shortest paths between other nodes in the network. Higher value means many shortest paths pass through this network.
  • Closeness Centrality: How close a node is to all other nodes in the network. Higher value means this node can quickly interact with other nodes in the network.

While the visualization of the road network offers a broad overview of its structure, including road nodes and connections, the visual density creates clutter and thus limits any further detailed analysis. We can explore this dataset with other visual models to gain different angles of insights.

Community Detection Clustering

Implementations of community detection algorithms can help identify clusters or communities within the road network are more densely connected. That information can then be further subjected to statistical analysis to reveal critical factors that influence overall traffic flow and congestion.

#Preparing Multilevel (Louvain Method), Edge Betweenness (Girvan-Newman Algorithm), and Fast Greedy (Clauset-Newman-Moore Algorithm)
community_multilevel <- cluster_louvain(roadseg_graph)
community_edge_betweenness <- cluster_edge_betweenness(roadseg_graph)
community_fast_greedy <- cluster_fast_greedy(roadseg_graph)

Calculating modularity score:

modularity_multilevel <- modularity(community_multilevel)
modularity_edge_betweenness <- modularity(community_edge_betweenness)
modularity_fast_greedy <- modularity(community_fast_greedy)

Plotting the three different community detection algorithm:

# Set up the plotting layout to have 3 plots in one row
par(mfrow = c(1, 3), mar = c(5, 4, 4, 2) + 0.1)

# Plot the road network with communities identified by Multilevel
set.seed(42)
plot(community_multilevel, roadseg_graph,
     layout = layout_with_fr,
     vertex.size = 5, 
     vertex.label = NA,
     edge.color = "black",
     main = "Multilevel")
mtext(paste("Modularity:", round(modularity_multilevel, 3)), side = 1, line = 4, adj = 0.5)

# Plot the road network with communities identified by Edge Betweenness
set.seed(42)
plot(community_edge_betweenness, roadseg_graph,
     layout = layout_with_fr,
     vertex.size = 5, 
     vertex.label = NA,
     edge.color = "black",
     main = "Edge-Betweenness")
mtext(paste("Modularity:", round(modularity_edge_betweenness, 3)), side = 1, line = 4, adj = 0.5)

# Plot the road network with communities identified by Fast Greedy
set.seed(42)
plot(community_fast_greedy, roadseg_graph,
     layout = layout_with_fr,
     vertex.size = 5, 
     vertex.label = NA,
     edge.color = "black",
     main = "Fast-Greedy")
mtext(paste("Modularity:", round(modularity_fast_greedy, 3)), side = 1, line = 4, adj = 0.5)

These graphs reveal the structure of the road network, showing multiple overlapping communities existing.

Modularity measures the strength of communities, with higher value (1.0) signifying ddense connections within communities, and the opposite for connections between cross-communities.

The modularity score result can be interpreted as follows:

  • Multilevel (0.455): There is a strong community structure, and roads (network) can be meaningful partitioned into major communities with relatively high internal connectivity (traffic flow regulation).
  • Edge-Betweenness (0.442): There are key roads whose removal would significantly impact the network’s connectivity, resulting in more traffic congestion.
  • Fast-Greedy (0.459): There is a strong community structure detected, with broad community structures marked to reflect the primary groupings of roads (nodes) within the network.

Traffic Flow Visualization I

Another visualization that we can utilize is to mark each road segments based on the congestion factor, which is measure indicating the level of traffic congestion, derived from historical traffic data for different.

Start with the same network visualization as before, but employ color-coding based on the node’s congestion factor:

# Reset the plotting layout
par(mfrow = c(1, 1))

# Nodes colored by congestion factor (3, 2, 1)
vertex_colors <- ifelse(traffic_df$congestion_factor > 3, "firebrick2",
                        ifelse(traffic_df$congestion_factor > 2, "goldenrod1", "forestgreen"))

# Plot traffic flow on the road network with colored nodes
set.seed(2)
plot(roadseg_graph, 
     edge.color = "black",
     vertex.size = 5,
     vertex.label = NA,
     vertex.color = vertex_colors,  # Use defined colors
     layout = layout.fruchterman.reingold,
     main = "Traffic Flow by Congestion Visualization")

# Legend
legend_labels <- c("0-1", "2-3", "3+")
legend_title <- "Congestion Level"
legend("bottomright", 
       legend = legend_labels,
       title = legend_title,
       pch = c(21),
       lty = c(1),
       pt.bg = c("forestgreen", "goldenrod1","firebrick2"),
)

Traffic Flow Visualization II

As mentioned previously, the dense cluttering of this data through this network visualization model makes it difficult to discern any worthwhile information at first glance.

Instead, let’s transform this dataset into a scatterplot model:

# Define colors based on congestion factor
traffic_df$color <- ifelse(traffic_df$congestion_factor > 3, "3+",
                           ifelse(traffic_df$congestion_factor > 2, "2 - 3", "0 - 1"))

# Plot traffic flow using ggplot2
ggplot(traffic_df, 
       aes(x = vehicles, y = congestion_factor, color = color)) + 
  geom_point(size = 3) +
  scale_color_manual(values = c("3+" = "firebrick2", 
                                "2 - 3" = "goldenrod1", 
                                "0 - 1" = "forestgreen"),
                     name = "Congestion Level") +
  labs(x = "Number of Vehicles", 
       y = "Congestion Factor", 
       title = "Road Congestion by Vehicles Count Visualization",
       caption = "Data Source: CityTrans") +
  theme_minimal() +
  theme(legend.position = "bottom")

This scatterplot visualization provides a better insight into the distribution of congestion level. We can infer that there are a higher count of roads with congestion factor above 1, with a wide vehicles count distribution ranging from 0 to 100 across the network. But still, there doesn’t seem to be any significant trend or patterns that can be easily discerned.

However, to better understand statistical significance of this observation, we need to utilize rigorous statistical analysis methods to determine whether this observation is statistically significant, or is a coincidence/anomaly.

Statistical Analysis

Performing statistical analysis can help identify and understand the relationship between different variables in our CityTrans dataset, such as the correlation between congestion level and other factors such as the length of the road, or how many lanes it has.

To start, perform a fitted regression model of the dataset. Since we have two different datasets sharing the same road segments (nodes), but different unique attributes, we will merge them together:

merged_data <- merge(roadseg_df, traffic_df, by = c("source", "target"))

# Subset merged dataset to include relevant columns
subset_data <- merged_data %>%
  select(congestion_factor, length, speed_limit, lanes, vehicles, speed)

# Fit linear regression model (congestion factor as the predictor variable)
model <- lm(congestion_factor ~ length + speed_limit + lanes + vehicles + speed, data = subset_data)

# Acquiring other relevant statistics
congestion_mean <- mean(traffic_df$congestion_factor)
congestion_median <- median(traffic_df$congestion_factor)

# Outputs regresion model result
summary(model)
## 
## Call:
## lm(formula = congestion_factor ~ length + speed_limit + lanes + 
##     vehicles + speed, data = subset_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98710 -0.27905 -0.03216  0.30354  1.05683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7785852  0.1850960  20.414   <2e-16 ***
## length       0.0090293  0.0099983   0.903   0.3676    
## speed_limit  0.0032417  0.0035538   0.912   0.3628    
## lanes       -0.4835076  0.0252054 -19.183   <2e-16 ***
## vehicles     0.0011724  0.0009725   1.206   0.2295    
## speed       -0.0046606  0.0021130  -2.206   0.0286 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3964 on 194 degrees of freedom
## Multiple R-squared:  0.6719, Adjusted R-squared:  0.6634 
## F-statistic: 79.45 on 5 and 194 DF,  p-value: < 2.2e-16
# Outputs congestion factor's mean and median statistics
message(paste("Mean of Congestion Factor:", congestion_mean,
              "\nMedian of Congestion Factor:", congestion_median))
## Mean of Congestion Factor: 2.75420312228909 
## Median of Congestion Factor: 2.76843110181654

The summary output of our fitted regression model uses congestion factor as the predictor variable, and contains several key components for our analysis.

Residual

Residual is the differences between observed values of the dependent variable, against the predicted values in the model. In general, we aim to have a tight residual spread around 0, to ensure that the model isn’t displaying prediction errors.

A residual standard error of 0.3964 on 194 degrees of freedom shows that although our model sample size is fair, the observed variables aren’t necessarily capable of capturing all the variabilities, but the Adjusted R-squared of 0.6634 indicates that approximately 66.34% of the variability in the congestion factor can be explained using this model.

The median residual value of -0.03216, meaning that the model’s predictions are generally quite close to the observed values. Evidently, there are outlier residuals in both extremes, thus causing over- and under-prediction of the congestion factor by -0.9871 and 1.05683, respectively. We can visualize this with a scatterplot:

subset_data$residuals <- residuals(model)

# Identify top 10 positive outliers (under-predictions)
top_outliers <- head(order(subset_data$residuals, decreasing = TRUE), 10)

# Identify top 10 negative outliers (over-predictions)
bottom_outliers <- head(order(subset_data$residuals), 10)

# Create a residual plot using ggplot2
ggplot(subset_data, aes(x = congestion_factor, y = residuals)) +
  geom_point(color = ifelse(row.names(subset_data) %in% top_outliers | row.names(subset_data) %in% bottom_outliers, "red", "black")) + 
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a dashed line at y = 0
  labs(x = "Congestion Factor", y = "Residuals", title = "Congestion Factor Residual Plot (10 Top/Bottom Residuals Highlighted)")

Coefficients

Taken from the previous fitted regression model:

## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7785852  0.1850960  20.414   <2e-16 ***
## length       0.0090293  0.0099983   0.903   0.3676    
## speed_limit  0.0032417  0.0035538   0.912   0.3628    
## lanes       -0.4835076  0.0252054 -19.183   <2e-16 ***
## vehicles     0.0011724  0.0009725   1.206   0.2295    
## speed       -0.0046606  0.0021130  -2.206   0.0286 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this model, the congestion factor serves as the predictor variable, implying that we will observe the impacts of other factors on this variable. The factors are as follows:

  • Intercept with estimate of 3.778, with a p-value of <0.001, indicating extremely strong significance. This is the value of our dependent variable (congestion factor) when all other independent variables are set to zero. But this doesn’t make much sense as a standalone statistic, and therefore must be used alongside the other coefficient results. On the other hand, a non-coefficient statistic that can be more easily understood would be:
  • The congestion factor mean of 2.754 and a median of 2.768. We use the median to account for outliers, as observed in the preceding analysis of residuals. Overall, this shows the average amount of congestion level that exists in the sampled dataset, and reinforces the notion that traffic congestion is severely impacting many roads.
  • Lanes with estimate of -0.4835076, with a p-value of <0.001, indicating extremely strong significance. There is a respectable standard error of 0.025 for our estimate coefficient to give accuracy lee-ways, but we can confidently infer that the increase of lanes on the road can positively reduce traffic congestion by a substancial amount.
  • Speed with estimate of -0.0046606, with a p-value of <0.05, indicating strong significance. There is a tight standard error of 0.002 for our estimate coefficient, but we can confidently infer that the increase of observed average speed on the road can positively reduce traffic congestion. But while recognizing the importance of causation, it’s important to understand that correlation does not always imply causation.
  • Factors of speed limit, vehicles count, and length of the road can be argued to have positive impact on traffic congestion the shorter/lower they are due to the coefficient estimate, but the p-value confidence is too high thus making it an unreliable variables for influencing the predictor.

Correlation Matrix

The fitted regression model shows the relationship between the predictor (congestion factor) variable against an independent variable, but what about the relationship between an independent variable against others? To see the strength and direction of linear relationship between any two given numeric variables, we can form a relationship matrix using the same data as above, and use a -1 to 1 range to indicate negative/positive linear relationship, while values close to 0 would indicate weak or no linear relationship. We do this by performing a correlation coefficient matrix:

options(scipen = 999)
cor(subset_data)
##                   congestion_factor                    length
## congestion_factor        1.00000000  0.0803482312266214082630
## length                   0.08034823  1.0000000000000000000000
## speed_limit             -0.06855566 -0.0067472999815900392903
## lanes                   -0.81269287 -0.0616198799158781049257
## vehicles                 0.13171087 -0.0175869407624822289560
## speed                   -0.15597566  0.0512674858580876510739
## residuals                0.57282267  0.0000000000000001066947
##                                 speed_limit                      lanes
## congestion_factor -0.0685556602208557408495 -0.81269286945494578766613
## length            -0.0067472999815900392903 -0.06161987991587810492566
## speed_limit        1.0000000000000000000000  0.08527346449708933517897
## lanes              0.0852734644970893351790  1.00000000000000000000000
## vehicles          -0.0622245940424820814796 -0.10500949748596055677297
## speed              0.3830088452800300413692  0.09301196229960378980550
## residuals          0.0000000000000003989702  0.00000000000000009024039
##                                     vehicles                      speed
## congestion_factor  0.13171087346286755592750 -0.15597565859919718112003
## length            -0.01758694076248222895598  0.05126748585808765107386
## speed_limit       -0.06222459404248208147958  0.38300884528003004136920
## lanes             -0.10500949748596055677297  0.09301196229960378980550
## vehicles           1.00000000000000000000000 -0.01011177003758201471684
## speed             -0.01011177003758201471684  1.00000000000000000000000
## residuals          0.00000000000000001271505  0.00000000000000003474927
##                                   residuals
## congestion_factor 0.57282266918204027827954
## length            0.00000000000000010669469
## speed_limit       0.00000000000000039897025
## lanes             0.00000000000000009024039
## vehicles          0.00000000000000001271505
## speed             0.00000000000000003474927
## residuals         1.00000000000000000000000

Residual Heteroscedasticity

In our residual model, hetereoscedasticity can occur when variabilities is not constant across all levels of the independent variables. A high test statistic (with strong confidence) can indicate that the residual have non-constant variance, which can affect the reliability of the model’s estimates and inferences. We do this by performing the Breusch-Pagan test:

bptest(model)
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 9.4056, df = 5, p-value = 0.09394

The result is the test statistics of 9.4056 (5 df), with a p-value of <0.1. Thus, we can infer that there is no strong evidence suggesting that the residuals does exhibit heteroscedasticity.

Residual Normality

Another test we can do is to assess the normality of the residuals in our linear regression model. We can do this by performing the Shapiro-Wilk Normality test:

shapiro.test(residuals(model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.99424, p-value = 0.6358

The result is the test statistic of 0.99424, but with a high p-value of 0.6358. Thus, we can infer that there is no strong evidence to suggest that the residual does indeed deviate from a normal distribution significantly.

Conclusion

In this analysis, we explored the factors influencing traffic congestion using the CityTrans dataset. Through graph visualization, community detection, and statistical analysis, we gained insights into the dynamics of traffic congestion, and also potential underlying determinants.

The three key findings in our analysis reveals:

Overall, our analysis contributes a better understanding of congestion in urban areas, and provides valuable insights for policymakers, urban planners, and transportation authorities. By addressing the key factors revealed through these analysis, such as road design and management strategies, we can work towards alleviating congestion and improving the efficiency of urban transportation systems.

Limitations

While our analysis provides valuable insights, it comes with noticeable limitations. One limitation of historical-data is that is does not capture real-time fluctations in traffic patterns, especially when future policies and changes have altered the physical condition of the road after the data has been taken. Moreover, the analysis emphasizes road characteristics for the sake of simplicity, overlooking external factors such as weather conditions and individual behavior, despite the availability of this information in the accompanying dataset within the same repository.

Another limitation is the the data transparency and accuracy. The dataset used in this analysis is recently published on Kaggle and sourced from a reputable author, but at the moment, the author have not provided direct links or references to the traffic data sources. The only information revealed about the metadata by the author is the recording of real-time and historical traffic flow information, the latter for traffic congestion data, and randomization for time delays, weather and air quality data. This makes it challenging to verify the accuracy and reliability of the underlying data, as it limits the ability of researchers to independently verify the findings and assess the quality of the data outside of this repository and author’s claim.

References

[1] CityTrans Dataset. Retrieved from Kaggle