Exploratory data analysis (EDA) relies on graphical summaries—boxplots, histograms, scatter plots—to reveal a dataset’s salient features before formal modeling. Yet modern data are increasingly recorded not as single scalar values but as intervals, histograms, or full empirical distributions. These richer objects are collectively known as symbolic data (Billard and Diday, 2006). For example, when individual-level measurements are aggregated by group, each variable naturally becomes an interval \([a, b]\) rather than a point value.
Conventional R graphics cannot natively accommodate interval-valued observations. ggInterval (formerly ggESDA) bridges this gap by extending the ggplot2 framework to visualize interval-valued symbolic data. The package provides a family of plot functions with a uniform interface:
ggInterval_<GRAPH_TYPE>(data, mapping = aes(...), ...)
where data is a symbolic data object and
mapping uses the standard ggplot2::aes()
syntax. Because most plot functions return ggplot2 objects,
users can freely add themes, scales, labels, and additional layers.
The package ships with several symbolic datasets. The most commonly used are:
facedata – 24 faces (8 ethnic groups
\(\times\) 3 replicates) with 6
interval-valued facial measurements (AD, BC, AH, GH, EH, BG).Environment – 14 cities described by
17 variables, including both interval-valued and modal multi-valued
variables.oils – 8 types of oils with 4
interval-valued chemical properties.data(facedata)
facedata
#> # A tibble: 27 × 6
#> AD BC AH DH
#> * <symblc_n> <symblc_n> <symblc_n> <symblc_n>
#> 1 [155.00 : 157.00] [58.00 : 61.01] [100.45 : 103.28] [105.00 : 107.30]
#> 2 [154.00 : 160.01] [57.00 : 64.00] [101.98 : 105.55] [104.35 : 107.30]
#> 3 [154.01 : 161.00] [57.00 : 63.00] [99.36 : 105.65] [101.04 : 109.04]
#> 4 [168.86 : 172.84] [58.55 : 63.39] [102.83 : 106.53] [122.38 : 124.52]
#> 5 [169.85 : 175.03] [60.21 : 64.38] [102.94 : 108.71] [120.24 : 124.52]
#> 6 [168.76 : 175.15] [61.40 : 63.51] [104.35 : 107.45] [120.93 : 125.18]
#> 7 [155.26 : 160.45] [53.15 : 60.21] [95.88 : 98.49] [91.68 : 94.37]
#> 8 [156.26 : 161.31] [51.09 : 60.07] [95.77 : 99.36] [91.21 : 96.83]
#> 9 [154.47 : 160.31] [55.08 : 59.03] [93.54 : 98.98] [90.43 : 96.43]
#> 10 [164.00 : 168.00] [55.01 : 60.03] [120.28 : 123.04] [117.52 : 121.02]
#> # ℹ 17 more rows
#> # ℹ 2 more variables: EH <symblc_n>, GH <symblc_n>
summary(facedata)
#> $symbolic_interval
#> AD BC AH DH
#> Min. [149.34 : 155.32] [50.36 : 55.23] [93.54 : 98.49] [90.43 : 94.37]
#> 1st Qu. [154.56 : 158.91] [53.60 : 59.08] [102.88 : 106.99] [105.18 : 111.07]
#> Median [163.00 : 167.07] [57.00 : 63.00] [115.26 : 119.60] [114.28 : 117.41]
#> Mean [162.90 : 162.90] [60.00 : 60.00] [113.17 : 113.17] [113.20 : 113.20]
#> 3rd Qu. [167.13 : 171.19] [61.22 : 65.04] [117.91 : 121.60] [117.10 : 121.72]
#> Max. [169.85 : 175.15] [66.03 : 69.01] [123.75 : 127.29] [124.08 : 127.78]
#> Std. [6.82 : 6.82] [4.42 : 4.42] [9.08 : 9.08] [9.48 : 9.48]
#> EH GH
#> Min. [49.41 : 54.64] [48.27 : 50.61]
#> 1st Qu. [54.65 : 58.49] [51.60 : 56.03]
#> Median [56.73 : 61.72] [55.32 : 60.46]
#> Mean [59.85 : 59.85] [57.69 : 57.69]
#> 3rd Qu. [60.96 : 65.80] [58.52 : 63.84]
#> Max. [63.89 : 69.07] [64.20 : 67.80]
#> Std. [4.04 : 4.04] [4.63 : 4.63]
classic2symClassical (scalar) data can be converted to symbolic interval data by
aggregating within groups. The classic2sym() function
supports several grouping strategies:
myIris <- classic2sym(iris, groupby = "Species")
myIris$intervalData
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa [4.30 : 5.80] [2.30 : 4.40] [1.00 : 1.90] [0.10 : 0.60]
#> versicolor [4.90 : 7.00] [2.00 : 3.40] [3.00 : 5.10] [1.00 : 1.80]
#> virginica [4.90 : 7.90] [2.20 : 3.80] [4.50 : 6.90] [1.40 : 2.50]
The groupby argument accepts:
"Species"."kmeans" or "hclust" for unsupervised
clustering (with k groups)."customize" for user-supplied minimum and maximum data
frames.myIris_km <- classic2sym(iris, groupby = "kmeans", k = 5)
myIris_km$intervalData
#> # A tibble: 5 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <symblc_n> <symblc_n> <symblc_n> <symblc_n>
#> 1 [5.60 : 7.00] [2.20 : 3.40] [4.30 : 5.60] [1.20 : 2.40]
#> 2 [4.90 : 5.80] [3.30 : 4.40] [1.20 : 1.90] [0.10 : 0.60]
#> 3 [4.30 : 5.00] [2.30 : 3.60] [1.00 : 1.90] [0.10 : 0.30]
#> 4 [6.30 : 7.90] [2.50 : 3.80] [5.10 : 6.90] [1.60 : 2.50]
#> 5 [4.90 : 6.10] [2.00 : 3.00] [3.00 : 4.50] [1.00 : 1.70]
#> # ℹ 1 more variable: Species <symblc_m>
RSDA2symIf you already have an RSDA symbolic_tbl object, wrap it
with RSDA2sym() so it can be used with all ggInterval plot
functions:
mySym <- RSDA2sym(Cardiological)
mySym$intervalData
ggInterval provides S3 methods for common statistical summaries on symbolic interval data.
summary() reports the minimum, quartiles, median, mean,
maximum, and standard deviation for each interval-valued variable:
summary(facedata)
#> $symbolic_interval
#> AD BC AH DH
#> Min. [149.34 : 155.32] [50.36 : 55.23] [93.54 : 98.49] [90.43 : 94.37]
#> 1st Qu. [154.56 : 158.91] [53.60 : 59.08] [102.88 : 106.99] [105.18 : 111.07]
#> Median [163.00 : 167.07] [57.00 : 63.00] [115.26 : 119.60] [114.28 : 117.41]
#> Mean [162.90 : 162.90] [60.00 : 60.00] [113.17 : 113.17] [113.20 : 113.20]
#> 3rd Qu. [167.13 : 171.19] [61.22 : 65.04] [117.91 : 121.60] [117.10 : 121.72]
#> Max. [169.85 : 175.15] [66.03 : 69.01] [123.75 : 127.29] [124.08 : 127.78]
#> Std. [6.82 : 6.82] [4.42 : 4.42] [9.08 : 9.08] [9.48 : 9.48]
#> EH GH
#> Min. [49.41 : 54.64] [48.27 : 50.61]
#> 1st Qu. [54.65 : 58.49] [51.60 : 56.03]
#> Median [56.73 : 61.72] [55.32 : 60.46]
#> Mean [59.85 : 59.85] [57.69 : 57.69]
#> 3rd Qu. [60.96 : 65.80] [58.52 : 63.84]
#> Max. [63.89 : 69.07] [64.20 : 67.80]
#> Std. [4.04 : 4.04] [4.63 : 4.63]
cor() and cov() compute association
matrices. Several methods are available for interval data, including
"centers", "B" (Billard), "BD"
(Billard–Diday), and "BG" (Billard–Greco):
cor(facedata)
#> AD BC AH DH EH GH
#> AD 1.000000000 0.6882596 0.3770045 0.6305841 0.005217304 0.1873164
#> BC 0.688259575 1.0000000 0.2910128 0.4634647 0.193673951 0.2351438
#> AH 0.377004536 0.2910128 1.0000000 0.7062072 -0.376548510 -0.6085799
#> DH 0.630584078 0.4634647 0.7062072 1.0000000 -0.471592548 -0.2422946
#> EH 0.005217304 0.1936740 -0.3765485 -0.4715925 1.000000000 0.6889340
#> GH 0.187316425 0.2351438 -0.6085799 -0.2422946 0.688934015 1.0000000
cov(facedata)
#> AD BC AH DH EH GH
#> AD 46.4682449 20.719131 23.32847 40.76781 0.1435936 5.915365
#> BC 20.7191307 19.502140 11.66581 19.41128 3.4532110 4.810633
#> AH 23.3284745 11.665807 82.39925 60.79810 -13.8004527 -25.592151
#> DH 40.7678128 19.411276 60.79810 89.94822 -18.0581801 -10.645536
#> EH 0.1435936 3.453211 -13.80045 -18.05818 16.3012737 12.885936
#> GH 5.9153653 4.810633 -25.59215 -10.64554 12.8859360 21.461256
scale() standardizes symbolic interval data (centering
and scaling), which can be useful before multivariate analyses:
facedata_scaled <- scale(facedata)
facedata_scaled
#> <ggInterval>
#> Public:
#> clone: function (deep = FALSE)
#> clusterResult: NULL
#> initialize: function (rawData = NULL, statisticsDF = NULL, intervalData = NULL,
#> intervalData: data.frame, symbolic_tbl
#> rawData: NULL
#> statisticsDF: list
#> Private:
#> invalidDataType: function ()
ggInterval_indexplot() displays the interval range of
each observation as a vertical bar. This is useful for spotting outliers
and comparing spreads across observations.
ggInterval_indexplot(facedata, aes(x = AD))
ggInterval_indexImage() replaces the margin bars of the
index plot with a color-coded strip. The column_condition
parameter controls whether colors represent column-wise or matrix-wise
conditions, and full_strip expands the color strip to the
full figure width.
ggInterval_indexImage(facedata, aes(AD),
column_condition = TRUE, full_strip = FALSE)
ggInterval_indexImage(facedata, aes(AD),
column_condition = TRUE, full_strip = TRUE) +
coord_flip()
ggInterval_boxplot() draws an interval-valued box plot,
where each observation’s interval is represented by nested rectangles
showing the distribution of the interval endpoints. Use
plotAll = TRUE to display all variables side by side.
ggInterval_boxplot(facedata, aes(AD))
ggInterval_boxplot(facedata, plotAll = TRUE)
ggInterval_hist() constructs a histogram from
interval-valued data. Two binning strategies are supported:
method = "equal-bin" (default): bins of equal
width.method = "unequal-bin": bin boundaries depend on the
data distribution.Note that ggInterval_hist() returns a list; use
$plot to extract the ggplot2 object.
ggInterval_hist(facedata, aes(x = AD), bins = 10,
method = "equal-bin")$plot
ggInterval_hist(facedata, aes(x = AD),
method = "unequal-bin")$plot
ggInterval_MMplot() marks the minimum and maximum
endpoints of each observation’s interval, connected by a line segment.
This makes it easy to compare ranges across observations.
ggInterval_MMplot(facedata, aes(AD))
Use plotAll = TRUE to display all variables
together:
ggInterval_MMplot(facedata, plotAll = TRUE)
ggInterval_CRplot() plots each observation as a point in
a two-dimensional space where the x-axis is the center (midpoint) of the
interval and the y-axis is the range (spread).
ggInterval_CRplot(facedata, aes(AD))
ggInterval_CRplot(facedata, plotAll = TRUE)
ggInterval_scatterplot() visualizes two interval-valued
variables simultaneously. Each observation is drawn as a rectangle whose
width and height represent the intervals on the x- and y-axes,
respectively.
ggInterval_scatterplot(facedata, aes(x = AD, y = BC))
ggInterval_2Dhist() partitions the bivariate domain into
a grid and counts how many interval observations overlap each cell. The
xBins and yBins parameters control the grid
resolution.
ggInterval_2Dhist(facedata, aes(x = AD, y = BC), xBins = 10, yBins = 10)
#> $plot
#>
#> $`Table (AD, BC)`
#> [50:52.23] [52:54.09] [54:55.95] [56:57.82] [58:59.69]
#> [149:151.92] 0.017 0.183 0.334 0.339 0.191
#> [152:154.5] 0.086 0.359 0.511 0.414 0.269
#> [155:157.08] 0.414 1.067 1.162 0.727 1.19
#> [157:159.66] 0.178 0.395 0.522 0.575 0.608
#> [160:162.24] 0.041 0.088 0.132 0.193 0.216
#> [162:164.83] 0 0.003 0.176 0.251 0.187
#> [165:167.41] 0 0.004 0.383 0.618 0.481
#> [167:169.99] 0 0.004 0.243 0.343 0.404
#> [170:172.57] 0 0 0.001 0.001 0.165
#> [173:175.15] 0 0 0 0 0.016
#> Frequency of AD 0.736 2.103 3.464 3.461 3.727
#> Margin of AD 0.027 0.078 0.128 0.128 0.138
#> [60:61.55] [62:63.42] [63:65.28] [65:67.15] [67:69.01]
#> [149:151.92] 0.015 0 0 0 0
#> [152:154.5] 0.068 0.039 0.007 0 0
#> [155:157.08] 0.727 0.204 0.036 0 0
#> [157:159.66] 0.29 0.204 0.036 0 0
#> [160:162.24] 0.1 0.062 0.055 0.078 0.078
#> [162:164.83] 0.014 0 0.122 0.526 0.426
#> [165:167.41] 0.054 0.038 0.217 0.888 0.622
#> [167:169.99] 0.562 1.003 1.192 0.802 0.395
#> [170:172.57] 0.701 1.286 0.901 0.458 0.225
#> [173:175.15] 0.252 0.652 0.178 0 0
#> Frequency of AD 2.783 3.488 2.744 2.752 1.746
#> Margin of AD 0.103 0.129 0.102 0.102 0.065
#> Frequency of BC Margin of BC
#> [149:151.92] 1.079 0.04
#> [152:154.5] 1.753 0.065
#> [155:157.08] 5.527 0.205
#> [157:159.66] 2.808 0.104
#> [160:162.24] 1.043 0.039
#> [162:164.83] 1.705 0.063
#> [165:167.41] 3.305 0.122
#> [167:169.99] 4.948 0.183
#> [170:172.57] 3.738 0.138
#> [173:175.15] 1.098 0.041
#> Frequency of AD 27
#> Margin of AD 1
Here is the same plot for the oils dataset:
data(oils)
ggInterval_2Dhist(oils, aes(x = GRA, y = FRE), xBins = 5, yBins = 5)
#> $plot
#>
#> $`Table (GRA, FRE)`
#> [-27:-14] [-14:-1] [-1:12] [12:25] [25:38] Frequency of FRE
#> [1:0.87] 0 0 0 0.3 1.7 2
#> [1:0.89] 0 0 0 0 0 0
#> [1:0.91] 0 0 0 0 0 0
#> [1:0.92] 1 1.2 1 0 0 3.2
#> [1:0.94] 0.684 2.116 0 0 0 2.8
#> Frequency of GRA 1.684 3.316 1 0.3 1.7 8
#> Margin of GRA 0.211 0.414 0.125 0.038 0.212
#> Margin of FRE
#> [1:0.87] 0.25
#> [1:0.89] 0
#> [1:0.91] 0
#> [1:0.92] 0.4
#> [1:0.94] 0.35
#> Frequency of GRA
#> Margin of GRA 1
ggInterval_scatterMatrix() produces a pairwise scatter
plot matrix for all continuous interval variables in the dataset. Note
that this function returns a marrangeGrob object (from
gridExtra), not a ggplot2 object.
ggInterval_scatterMatrix(facedata[, 1:3])
ggInterval_2DhistMatrix() is the matrix analogue of
ggInterval_2Dhist(), showing 2D histograms for all variable
pairs.
ggInterval_2DhistMatrix(oils, xBins = 5, yBins = 5)
When plotAll = TRUE,
ggInterval_indexImage() produces a heatmap-style
visualization across all variables, providing an overview of the entire
dataset.
ggInterval_indexImage(facedata, plotAll = TRUE)
ggInterval_radarplot() displays multiple interval-valued
variables on radial axes. Each observation is represented by a polygon
(or rectangle) whose extent along each axis shows the interval range.
The plotPartial argument selects which observations to
display.
data(Environment)
ggInterval_radarplot(Environment[, 5:17],
plotPartial = 2,
showLegend = FALSE,
base_circle = TRUE,
base_lty = 2,
addText = FALSE) +
labs(title = "Environment: radar plot (default)")
The type = "rect" variant draws rectangles instead of
polygons:
ggInterval_radarplot(Environment[, 5:17],
plotPartial = 2,
type = "rect",
showLegend = FALSE,
base_circle = TRUE,
addText = FALSE) +
labs(title = "Environment: radar plot (rect)")
ggInterval_3Dscatterplot() visualizes three
interval-valued variables, rendering each observation as a cube-like
shape projected into two dimensions.
ggInterval_3Dscatterplot(facedata[1:5, ], aes(x = BC, y = EH, z = GH))
ggInterval_PCA() performs vertices-based PCA on
interval-valued data. Each interval observation is expanded to its
vertices (all \(2^p\) corner
combinations), PCA is applied, and the results are projected back to
interval form.
pca_result <- ggInterval_PCA(facedata, plot = FALSE)
pca_result$ggplotPCA
Setting poly = TRUE adds a convex-hull polygon
connecting the projected vertices for each observation:
pca_poly <- ggInterval_PCA(facedata, poly = TRUE, plot = FALSE)
pca_poly$ggplotPCA
PCA also works with classical data via automatic conversion:
myIris <- classic2sym(iris, groupby = "Species")
pca_iris <- ggInterval_PCA(myIris, plot = FALSE)
pca_iris$ggplotPCA
Because most ggInterval functions return standard
ggplot2 objects, you can customize plots with the full
range of ggplot2 features.
Themes and labels:
ggInterval_indexplot(facedata, aes(x = AD)) +
theme_minimal() +
labs(title = "Index plot of AD", x = "Observation", y = "AD")
Custom color scales:
p <- ggInterval_hist(facedata, aes(x = AD), bins = 10,
method = "equal-bin")$plot
p + scale_fill_manual(values = rainbow(10))
Adding reference lines:
ggInterval_CRplot(facedata, aes(AD)) +
geom_hline(yintercept = 5, linetype = "dashed", color = "red")
Note that ggInterval_scatterMatrix() returns a
marrangeGrob object, so ggplot2 + operators
cannot be applied to it directly.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester.
Jiang, B.S. and Wu, H.M. (2025). ggInterval: an R package for visualizing interval-valued data using ggplot2. R package version 0.2.3, https://CRAN.R-project.org/package=ggInterval.