--- title: "Example analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{example_analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(GWlasso) ``` # What is a Geographically Weighted Model ? Imagine you want to predict an environmental variable $Y$ on one site based on the distribution of species of some kind you observed in this site. You will fit a model on data, gathered around the world, associating species distribution and $Y$. Now, the problem is that species are not ubiquitous, and the distribution of species, although varying according to $Y$, also varies according to latitude and longitude. This means that a species $A$ may be strongly linked to $Y$ in England but not at all in Bulgaria for instance. Or a species $B$ could be completely absent in some areas. Geographically weighted models tackle these issues by taking into account the spatial heterogeneity of species distribution. It relies on two parameters : - the bandwidth - the distance between sampling sites The **bandwidth** determine the number of samples around our sample of interest used to fit the model. It can be of two kind : - fixed : the bandwidth is then the *radius* around the sample of interest into which the samples must be to used to fit the model. - adaptive : the bandwidth is then the $n$ closest sample around the sample of interest that will be used to fit the model # What is Lasso ? Without giving much details, let's say that lasso is a regression method that seeks to have the best accuracy while discarding variables that are not really informative for the prediction (the linera coefficient for these variables will be shrunk to zero). It is especially useful in ecology to tackle the non ubiquitous nature of species. With each local regression, the species can have different informative values depending on the area they're found. Lasso will take care of that. # What to run ## data In our example, we will use dataset from Amesbury (2016). Load it up with. ```{r Amesbury} data(Amesbury) ``` ## Setting the bandwith One key element to fit geographically weighted models is the bandwidth. There is no rule of thumb, as it depends on the distribution of the samples used to fit the model on one hand, and the biogeographical properties of the investigated species. Nevertheless, if the samples used to fit the model are somewhat geographically evenly distributed, a fixed bandwidth will do the job. To select the bandwidth you need : - a distance matrix - a species abundance matrix (or data frame). It should be transformed in a suitable format for regressions, such as relative abundance or hellinger transformation. See [vegan::decostand()] - a vector containing the environmental variable to predict. The function *gwl_bw_estimation* run a bruteforce algorithm to select the bandwidth that produces the smallest $RMSE$. In the following example : - *coords.sample* is a dataframe with two columnes containing latitude and longitude of the samples - *sp.df* is a dataframe contaning species as columns and samples as rows - *my.y* is the environmental variable of interest, to reconstruct - for the other arguments please refer to the help page of the functions ```{r getbw, eval = FALSE, include = TRUE} # compute the distance matrix distance_matrix <- compute_distance_matrix(Amesbury$coords, add.noise = TRUE) # run the bw selection algorithm bw_choice <- gwl_bw_estimation(x.var = Amesbury$spe.df, y.var = Amesbury$WTD, dist.mat = distance_matrix, adaptive = TRUE, adptbwd.thresh = 0.1, kernel = "bisquare", alpha = 1, progress = TRUE, n = 100) ``` This command is **extremely long to run**. For a dataset of 1100 samples and 45 species it takes approximately 24 hours.The time can be reduced by playing with the $n$ parameter, that set the number of bandwidths to test. Parallelizing the code is a work in progress. ## Fitting a model Once you have chosen a bandwidth, you can pursue. If you do not wich to run *gwl_bw_estimation()*, you can eyeball a bandwith, relying on your ecological expertise. Let's say that you decide to go for an adaptive bandwidth of 120. It means that for each sample, a model will be computed with the 120 closest sample. ```{r fitgwl, eval = FALSE, include = TRUE} # compute the distance matrix distance_matrix <- compute_distance_matrix(Amesbury$coords, add.noise = TRUE) my.gwl.fit <- gwl_fit(bw= 120, x.var = Amesbury$spe.df, y.var = Amesbury$WTD, dist.mat = distance_matrix, adaptive = TRUE, kernel = "bisquare", alpha = 1, progress = TRUE) plot(my.gwl.fit) ``` The graph rendered by this chunk displayed shows you whether coefficient was set to zero or not for each species, fo each model (remember that in a GWL there is one local modal per sample). The upper panel show you the mean of the y.var values of the samples used to fit the local model. In ecology, it may help to grasp a sense of the indicative value of a taxa for a given ecological variable. ## Make prediction Once you fitted your model, you can of course use it as any other model in R to make prediction : ```{r pred, include = TRUE, eval=FALSE} my_predicted_values <- predict(my.gwl.fit, newdata = Amesbury$spe.df, newcoords = Amesbury$coords) plot( my_predicted_values ~Amesbury$WTD) abline(0,1, col="red") ```