---
title: "1 - Data Preparation"
author: "Marina Papadopoulou"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{1 - Data Preparation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## 1.1 Input data - trackdf

The `swaRmverse` package uses the [`trackdf`](https://github.com/swarm-lab/trackdf) package to standardize the input dataset. Data are expected to be trajectories (id, x, y, t) generated by GPS or video tracking. First, lets load some data from `trackdf`:

```{r message=FALSE, warning=FALSE}
library(swaRmverse)

raw <- read.csv(system.file("extdata/video/01.csv", package = "trackdf"))
raw <- raw[!raw$ignore, ]
head(raw)

```

## 1.2 Transform data

`trackdf` takes as input a vector for each positional time series (x,y) along with an vector of ids and time. Time will be transformed to date-time POSIXct format. Without additional information, the package uses UTC as timezone, current time as the origin of the experiment, and 1 second as the sampling step (time between observations). If your _t_ column corresponds to real time (and not frames or sampling steps, e.g., _c(1, 2, 3, 4)_), then the period doesn't have to be specified. For more details, see https://swarm-lab.github.io/trackdf/index.html. For now, let's specify these attributes and create our main dataset (as a dataframe):

```{r message=FALSE, warning=FALSE}

data_df <- set_data_format(raw_x = raw$x,
                          raw_y = raw$y,
                          raw_t = raw$frame,
                          raw_id = raw$track_fixed,
                          origin = "2020-02-1 12:00:21",
                          period = "0.04S",
                          tz = "America/New_York"
                          )

head(data_df)
```
You can now notice that a 'set' column is added to the dataset. `swaRmverse` is using this column as the main unit for grouping the tracks into separate events. By default, the day of data collection is used. 

## 1.3 Multi-species or multi-context data

As mentioned above, `swaRmverse` uses the date as a default data organization unit. However, if several separate observations are conducted in the same day, or an additional label on the data is needed, such as context or species, additional information can be given to the \code{set_data_format} function. For instance, let's assume that data from 2 different contexts exist in the data set:

```{r message=FALSE, warning=FALSE}
# dummy column
raw$context <- c(rep("ctx1", nrow(raw) / 2), rep("ctx2", nrow(raw) / 2))

```

We can give any additional vector to the function and it will be combined with the date column as a set:

```{r message=FALSE, warning=FALSE}

data_df <- set_data_format(raw_x = raw$x,
                          raw_y = raw$y,
                          raw_t = raw$frame,
                          raw_id = raw$track_fixed,
                          origin = "2020-02-1 12:00:21",
                          period = "0.04 seconds",
                          tz = "America/New_York",
                          raw_context = raw$context
                          )

head(data_df)
```

With this dataset, we can move on into analyzing the collective motion in the data.