--- title: "Getting started with tidyclust" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with tidyclust} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| include: false knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidyclust) ``` ## Introduction tidyclust provides a unified, tidy interface to clustering models, following the same design patterns as [parsnip](https://parsnip.tidymodels.org/). It lets you swap clustering algorithms by changing a single line, and integrates seamlessly with the rest of the tidymodels ecosystem (recipes, workflows, tune). ## The tidyclust workflow Every tidyclust analysis follows the same four steps: 1. **Create a model specification** — choose the algorithm and its parameters. 2. **Fit the specification** — train the model on data. 3. **Extract results** — get cluster assignments, centroids, and summaries. 4. **Evaluate** — use built-in metrics to assess cluster quality. ## K-means example ### 1. Create a specification ```{r} kmeans_spec <- k_means(num_clusters = 3) |> set_engine("stats") kmeans_spec ``` ### 2. Fit to data ```{r} set.seed(1234) kmeans_fit <- fit(kmeans_spec, ~., data = mtcars) kmeans_fit ``` ### 3. Extract results `extract_cluster_assignment()` returns the cluster label for each training observation: ```{r} extract_cluster_assignment(kmeans_fit) ``` `extract_centroids()` returns the location (mean) of each cluster: ```{r} extract_centroids(kmeans_fit) ``` `predict()` assigns new observations to clusters: ```{r} predict(kmeans_fit, new_data = mtcars[1:5, ]) ``` `augment()` appends the cluster assignment to the original data: ```{r} augment(kmeans_fit, new_data = mtcars) ``` ### 4. Evaluate tidyclust provides several cluster quality metrics: ```{r} sse_within_total(kmeans_fit, mtcars) sse_ratio(kmeans_fit, mtcars) silhouette_avg(kmeans_fit, mtcars) ``` Lower `sse_within_total()` and `sse_ratio()` indicate tighter clusters. Higher `silhouette_avg()` (maximum 1) indicates better-separated clusters. ## Hierarchical clustering example The same workflow applies to `hier_clust()`. The number of clusters is cut from the dendrogram at fit time using `num_clusters`: ```{r} hclust_spec <- hier_clust(num_clusters = 3) |> set_engine("stats") hclust_fit <- fit(hclust_spec, ~., data = mtcars) extract_cluster_assignment(hclust_fit) extract_centroids(hclust_fit) ``` ## Tidymodels integration tidyclust works with the broader tidymodels ecosystem. For example, you can preprocess data with a recipe and bundle it with a model in a workflow: ```{r} library(recipes) library(workflows) rec <- recipe(~., data = mtcars) |> step_normalize(all_predictors()) wf <- workflow() |> add_recipe(rec) |> add_model(k_means(num_clusters = 3)) wf_fit <- fit(wf, data = mtcars) augment(wf_fit, new_data = mtcars) ``` ## Next steps - Learn about tuning the number of clusters in `vignette("tuning_and_metrics", package = "tidyclust")`. - Explore k-means options in `vignette("k_means", package = "tidyclust")`. - Explore hierarchical clustering in `vignette("hier_clust", package = "tidyclust")`.