--- title: "Visualising the output of Monte Carlo methods with ggsmc" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Visualising the output of Monte Carlo methods with ggsmc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` The `ggsmc` package uses `ggplot2` to display the results of importance sampling (IS), sequential Monte Carlo (SMC) or ensemble-based algorithms. Each algorithm outputs a collection of points, usually evolved through a sequence of target distributions, which for IS and SMC are weighted. ## Data format To use this package the algorithm output must be in [tidy format](https://r4ds.had.co.nz/tidy-data.html), where each dimension of each parameter for each particle lies in a distinct row in a data frame. The data `sir_cwna_model` provides an example of valid input to the plotting functions. ```{r sir_data} library(ggsmc) data(sir_cwna_model) head(sir_cwna_model) ``` This data contains the output of a particle filter (PF) applied to a target tracking problem. The state $x=(x_1,x_2)^T$ tracked using the PF is two-dimensional, consisting of the (one-dimensional) position and velocity of the target. The value of the first particle for the first target distribution is given by $(x_1,x_2)^T=(3.0163,4.6682)^T$. In tidy format, this is stored on two rows of the data frame, where the `Dimension` column gives the index of the state `ParameterName`: e.g. in the row where `ParameterName=="x"` and `Dimension==2`, the `Value` column gives the value of $x_2$. If you are unfamiliar with it, this way of storing data with its many repeated values might seem wasteful. Its strength is that this format can be used consistently across different situations, allowing the use of general purpose packages for processing and, in our case, plotting the data. If your data is in the more standard matrix format for Monte Carlo output (i.e. one parameter vector per row), then you can use the `matrix2tidy` function included in the package to convert to the required format your algorithm output for each target distribution. For this function you need to supply your algorithm `output` in matrix format, a name for the `parameter` that is represented in the output, an integer index for the `target` to which the output corresponds and the `log_weights` of each particle if using an IS or SMC algorithm. If using an algorithm that iterates over multiple targets, the `matrix2tidy` function should be called for each target, then the output of each call stacked together using `rbind`. The `sir_cwna_model` data contains more columns than are required to use the plotting functions in this package. The columns required by all functions are: - `Target`, which indexes the target distribution, taking a different integer values for each target. - `Particle`, which indexes the particles, taking a different integer value for each particle. - `ParameterName`, which uses a string to name each parameter. - `Dimension`, which indexes the dimension of the parameter, taking a different integer value for each dimension. - `Value`, which stores the numerical value for the particle, target, parameter and dimension specified by the other columns. For the output of an IS or SMC algorithm, we may additionally include a `LogWeight` column to store the log of the (normalised) weight of the particle. If this column is not found in the data, each particle will be assigned an equal weight. ## Histograms and densities The `plot_histogram` and `plot_density` functions may be used to plot, respectively, a histogram or density of the marginal distribution of one dimension of one parameter. If the `LogWeight` column is present in the data a weighted histogram/density will be used. For these functions, a `parameter` (string) and `dimension` (integer) argument need to be used. If the `target` variable is set, the function will plot the marginal histogram/density for the specified target, parameter and dimension. If no target is specified, the points for all points for the specified parameter and dimension will be used for the plot. The `plot_histogram` function also takes a `bins` argument, which may be used to specify the number of bins for the histogram (if this argument is not specified, the default value in `ggplot2` is used). For an example, we look at the 20th target of the `sir_cwna_model` data using both a histogram and a density. ```{r histogram, fig.dim = c(6.666666, 5)} plot_histogram(sir_cwna_model, parameter = "x", dimension = 1, target = 20, bins = 20) ``` ```{r density, fig.dim = c(6.666666, 5)} plot_density(sir_cwna_model, parameter = "x", dimension = 1, target = 20) ``` ## Scatter plots The `plot_scatter` function can we used to creates a scatter plot of the Monte Carlo representation of the joint distribution between two parameters, or two dimensions of the same parameter. We specify the parameter and dimension for the x-axis using the arguments `x_parameter` and `x_dimension` respectively. The `target` variable plays the same role as for `plot_histogram` and `plot_density`. If a `LogWeight` column is present in the data, the size of the points in the scatter plot will be used to represent the particle weights. We illustrate this plot again on the 20th target of the `sir_cwna_model` data. ```{r scatter, fig.dim = c(6.666666, 5)} plot_scatter(sir_cwna_model, x_parameter = "x", x_dimension = 1, y_parameter = "x", y_dimension = 2, target = 20, alpha = 0.5, max_size = 3) ``` Note that in this plot we have used two additional arguments, `alpha` and `max_size`, to adjust the look of the plot. In this example we have very few importance points, I found the default value of `alpha = 0.1` (the transparency of the points from 0 to 1) to be too low, and the default `max_size = 1` (governing the size of the points) to be too small. ## Genealogies To more clearly understand the output of an SMC or ensemble-based algorithm it can be useful to visualise the evolution of the particles over time. The function `plot_genealogy` may be used for this purpose, for one dimension of one parameter. We again use the `sir_cwna_model` data to illustrate this function, showing the evolution of the PF's estimate of the target position over time. ```{r genealogy, fig.dim = c(6.666666, 5)} plot_genealogy(sir_cwna_model, parameter = "x", dimension = 1, use_initial_points = FALSE, vertical = FALSE, alpha_lines = 0, alpha_points = 0.05, arrows = FALSE) ``` The first few arguments are the same as those used for the `plot_density` function. We have used several additional arguments the alter the plot: - `use_initial_points` (default TRUE) governs the inclusion (or otherwise) of the initial unweighted points drawn from the proposal used to initialise the algorithm. In this case we chose to omit these points so that we show only the estimate of the [filtering distribution](https://en.wikipedia.org/wiki/Particle_filter#The_filtering_problem) at each time. - `vertical` (default TRUE) controls the orientation of the figure. - `alpha_lines` changes the transparency (from 0, transparent, to 1, solid) of the lines connecting corresponding points between successive targets. Here we choose to omit these lines by making them invisible. - `alpha_points` changes the transparency (from 0, transparent, to 1, solid) of the points (whose size is given by the `LogWeights` column if included in the data. - `arrows` (default TRUE) determines if arrows are included on the lines in the plot (omitted in this plot). To illustrate an alternative configuration of a genealogy plot, we use output from a different algorithm: an [SMC sampler](https://www.stats.ox.ac.uk/~doucet/delmoral_doucet_jasra_sequentialmontecarlosamplersJRSSB.pdf) applied to a sequence of targets on parameter $\theta$ that begins with a Gaussian distribution, and ends with a two-component mixture of Gaussians. ```{r genealogy2, fig.dim = c(6.666666, 5)} data(mixture_25_particles) plot_genealogy(mixture_25_particles, parameter = "θ", dimension = 1, alpha_lines = 0.2, alpha_points = 0.4) ``` For an SMC algorithm, due to the [resampling step](https://en.wikipedia.org/wiki/Particle_filter#Sequential_Importance_Resampling_(SIR)) the index of the ancestor of particle $i$ for each target is not likely to be $i$. To produce a plot with lines that connect each particle with its ancestor, we need an additional column in the data named `AncestorIndex`. ## Time series For some algorithms each Monte Carlo point represents a time series, where the index that thus far has been represented by the column `Dimension` can be thought of as indexing time. To illustrate this we plot simulations from a stochastic [Lotka-Volterra model](https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equations) produced during a run of an [approximate Bayesian computation (ABC)](https://en.wikipedia.org/wiki/Approximate_Bayesian_computation) algorithm. These are contained in the data `lv_output`. ```{r time_series, fig.dim = c(6.666666, 5)} data(lv_output) plot_time_series(lv_output, parameters = c("X","Y"), alpha = 0.5, ylimits=c(0,600)) ``` In this plot the weight (given by `LogWeight`) of each time series is represented by the width the line. We make the lines more visible by choosing `alpha = 0.5` compared to the default 0.1, and change the limits of the y-axis by using `ylimits=c(0,1000)`. There is only one target, so the `target` argument does not need to be specified. The `max_line_width` argument (not used here) can be used to scale the width of all lines in the plot. ## Modifying figures All figures are created using `ggplot2`, so can be modified using other functions from this package. We now demonstrate this by adding the true target position onto a plot of the `sir_cwna_model` data, and changing its style. We also demonstrate the ability to automatically generate a title for the figure. ```{r modified_genealogy, fig.dim = c(6.666666, 5)} data(cwna_data) plot_genealogy(sir_cwna_model, parameter = "x", dimension = 1, use_initial_points = FALSE, vertical = FALSE, alpha_lines = 0, alpha_points = 0.05, arrows = FALSE, default_title = TRUE) + ggplot2::geom_line(data=cwna_data,ggplot2::aes(x=Index,y=Position),colour="red",inherit.aes = FALSE) + ggplot2::theme_minimal() + ggplot2::theme(legend.position="none") ``` ## Animating plots The histogram, density, scatter and time series plots can all be animated, to show how the output changes over a sequence of target distributions. Below we show the code for generating (looping) 10 second animations (set by the `duration` argument) showing how the `sir_cwna_model` data changes over the sequence of 20 targets. ```{r animated_histogram, fig.dim = c(6.666666, 5), results='hide'} animate_histogram(sir_cwna_model, parameter = "x", dimension = 1, bins = 20, duration = 10) ``` ```{r animated_density, fig.dim = c(6.666666, 5), results='hide'} animate_density(sir_cwna_model, parameter = "x", dimension = 1, duration = 10) ``` ```{r animated_scatter, fig.dim = c(6.666666, 5), results='hide'} animate_scatter(sir_cwna_model, x_parameter = "x", x_dimension = 1, y_parameter = "x", y_dimension = 2, alpha = 0.5, max_size = 3, duration = 10) ``` For time series, there are two possible animations. One, `animate_time_series`, shows how the time series particles evolve over a sequence of (non-time ordered) targets. The other, `animate_reveal_time_series`, illustrated below on `lv_output`, animates the time series to show how the population of particles evolves over time. ```{r animated_time_series, fig.dim = c(6.666666, 5), results='hide'} data(`lv_output`) animate_reveal_time_series(lv_output, parameters = c("X","Y"), alpha = 0.5, ylimits=c(0,600), duration = 10) ```