Title: | Topological Data Analysis: Mapper Algorithm |
---|---|
Description: | The Mapper algorithm from Topological Data Analysis, the steps are as follows 1. Define a filter (lens) function on the data. 2. Perform clustering within each level set. 3. Generate a complex from the clustering results. |
Authors: | ChiChien Wang [aut, cre, trl], Paul Pearson [ctb], Daniel Muellner [ctb], Gurjeet Singh [ctb] |
Maintainer: | ChiChien Wang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2024-10-21 07:27:11 UTC |
Source: | https://github.com/kennywang112/mapperalgo |
Cut the hierarchical clustering tree to define clusters
cluster_cutoff_at_first_empty_bin(heights, diam, num_bins_when_clustering)
cluster_cutoff_at_first_empty_bin(heights, diam, num_bins_when_clustering)
heights |
Heights of the clusters. |
diam |
Diameter of the clusters. |
num_bins_when_clustering |
Number of bins when clustering. |
The cutoff height for the clusters.
Cover points based on intervals and overlap
cover_points( lsfi, filter_min, interval_width, percent_overlap, filter_values, num_intervals )
cover_points( lsfi, filter_min, interval_width, percent_overlap, filter_values, num_intervals )
lsfi |
Level set flat index. |
filter_min |
Minimum filter value. |
interval_width |
Width of the interval. |
percent_overlap |
Percentage overlap between intervals. |
filter_values |
The filter values to be analyzed. |
num_intervals |
Number of intervals. |
Indices of points in the range.
This function calculates the total within-cluster sum of squares (WSS) for a range of cluster numbers and identifies the best number of clusters (k) based on the elbow method.
find_best_k_for_kmeans(dist_object, max_clusters = 10)
find_best_k_for_kmeans(dist_object, max_clusters = 10)
dist_object |
A distance matrix or data frame containing the data to be clustered. |
max_clusters |
The maximum number of clusters to test for k-means. Default is 10. |
The optimal number of clusters (k) based on the elbow method.
The Mapper algorithm is a method for topological data analysis that provides a way to visualize the structure of high-dimensional data. The Mapper algorithm is a generalization of the Reeb graph construction, which is a method for visualizing the topology of scalar fields.
MapperAlgo(filter_values, intervals, percent_overlap, num_bins_when_clustering, methods)
MapperAlgo(filter_values, intervals, percent_overlap, num_bins_when_clustering, methods)
filter_values |
A data frame or matrix of the data to be analyzed. |
intervals |
An integer specifying the number of intervals to divide the filter values into. |
percent_overlap |
An integer specifying the percentage of overlap between consecutive intervals. |
num_bins_when_clustering |
An integer specifying the number of bins to use when clustering the data. |
methods |
A character string specifying the clustering method to use. The default is "hierarchical". |
An adjacency matrix and other components of the Mapper graph, including:
adjacency |
An adjacency matrix of the Mapper graph. |
num_vertices |
The number of vertices in the Mapper graph. |
level_of_vertex |
A vector specifying the level of each vertex. |
points_in_vertex |
A list of the indices of the points in each vertex. |
points_in_level_set |
A list of the indices of the points in each level set. |
vertices_in_level_set |
A list of the indices of the vertices in each level set. |
ChiChien Wang
The original paper on the Mapper algorithm is: G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007. This code is based on Paul Pearson's implementation of the Mapper algorithm in R, optimized for speed and memory usage. You can install using the following command: devtools::install_github("paultpearson/TDAmapper")
library(igraph) library(networkD3) data("iris") mapper <- MapperAlgo( filter_values = iris[,1:4], intervals = 4, percent_overlap = 50, num_bins_when_clustering = 30, methods = "hierarchical") graph <- graph.adjacency(mapper$adjacency, mode="undirected") l = length(V(graph)) Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } # Distribution of specific variable in each vertex - Majority vote var.maj.vertex <- c() filter.vertex <- c() for (i in 1:l){ points.in.vertex <- mapper$points_in_vertex[[i]] Mode.in.vertex <- Mode(iris$Species[points.in.vertex]) var.maj.vertex <- c(var.maj.vertex, as.character(Mode.in.vertex)) } # Size vertex.size <- rep(0, l) for (i in 1:l){ points.in.vertex <- mapper$points_in_vertex[[i]] vertex.size[i] <- length(mapper$points_in_vertex[[i]]) } MapperNodes <- mapperVertices(mapper, 1:nrow(iris)) MapperNodes$var.maj.vertex <- as.factor(var.maj.vertex) MapperNodes$Nodesize <- vertex.size MapperLinks <- mapperEdges(mapper) forceNetwork(Nodes = MapperNodes, Links = MapperLinks, Target = "Linktarget", Value = "Linkvalue", NodeID = "Nodename", Nodesize = "Nodesize", Group = "var.maj.vertex", opacity = 1, zoom = TRUE, linkDistance = 10, charge = -10, legend = TRUE)
library(igraph) library(networkD3) data("iris") mapper <- MapperAlgo( filter_values = iris[,1:4], intervals = 4, percent_overlap = 50, num_bins_when_clustering = 30, methods = "hierarchical") graph <- graph.adjacency(mapper$adjacency, mode="undirected") l = length(V(graph)) Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } # Distribution of specific variable in each vertex - Majority vote var.maj.vertex <- c() filter.vertex <- c() for (i in 1:l){ points.in.vertex <- mapper$points_in_vertex[[i]] Mode.in.vertex <- Mode(iris$Species[points.in.vertex]) var.maj.vertex <- c(var.maj.vertex, as.character(Mode.in.vertex)) } # Size vertex.size <- rep(0, l) for (i in 1:l){ points.in.vertex <- mapper$points_in_vertex[[i]] vertex.size[i] <- length(mapper$points_in_vertex[[i]]) } MapperNodes <- mapperVertices(mapper, 1:nrow(iris)) MapperNodes$var.maj.vertex <- as.factor(var.maj.vertex) MapperNodes$Nodesize <- vertex.size MapperLinks <- mapperEdges(mapper) forceNetwork(Nodes = MapperNodes, Links = MapperLinks, Target = "Linktarget", Value = "Linkvalue", NodeID = "Nodename", Nodesize = "Nodesize", Group = "var.maj.vertex", opacity = 1, zoom = TRUE, linkDistance = 10, charge = -10, legend = TRUE)
This function generates the edges of the Mapper graph by analyzing the adjacency matrix. It returns a data frame with source and target vertices that are connected by edges.
mapperEdges(m)
mapperEdges(m)
m |
The Mapper output object that contains the adjacency matrix and other graph components. |
A data frame containing the source (Linksource
), target (Linktarget
), and edge values (Linkvalue
) for the graph's edges.
This function generates the vertices of the Mapper graph, including their labels and groupings. It returns a data frame with the vertex names, the group each vertex belongs to, and the size of each vertex.
mapperVertices(m, pt_labels)
mapperVertices(m, pt_labels)
m |
The Mapper output object that contains information about the vertices and level sets. |
pt_labels |
A vector of point labels to be assigned to the points in each vertex. |
A data frame containing the vertex names (Nodename
), group information (Nodegroup
), and vertex sizes (Nodesize
).
Perform clustering within a level set
perform_clustering( points_in_this_level, filter_values, num_bins_when_clustering, methods, max_kmeans_clusters = 10, eps = 0.5, minPts = 5, num_clusters = 5 )
perform_clustering( points_in_this_level, filter_values, num_bins_when_clustering, methods, max_kmeans_clusters = 10, eps = 0.5, minPts = 5, num_clusters = 5 )
points_in_this_level |
Points in the current level set. |
filter_values |
The filter values. |
num_bins_when_clustering |
Number of bins when clustering. |
methods |
Specify the clustering method to be used, e.g., "hclust" or "kmeans". |
max_kmeans_clusters |
Maximum number of clusters when using k-means clustering. |
eps |
The maximum distance between two samples for one to be considered as in the neighborhood of the other. |
minPts |
The number of samples in a neighborhood for a point to be considered as a core point. |
num_clusters |
Number of clusters when using PAM clustering. |
A list containing the number of vertices, external indices, and internal indices.
Construct adjacency matrix of the simplicial complex
simplcial_complex( filter_values, vertex_index, num_levelsets, num_intervals, vertices_in_level_set, points_in_vertex )
simplcial_complex( filter_values, vertex_index, num_levelsets, num_intervals, vertices_in_level_set, points_in_vertex )
filter_values |
A matrix of filter values. |
vertex_index |
The number of vertices. |
num_levelsets |
The total number of level sets. |
num_intervals |
A vector representing the number of intervals for each filter. |
vertices_in_level_set |
A list where each element contains the vertices corresponding to each level set. |
points_in_vertex |
A list where each element contains the points corresponding to each vertex. |
An adjacency matrix representing the simplicial complex.
Convert level set multi-index (lsmi) to flat index (lsfi)
to_lsfi(lsmi, num_intervals)
to_lsfi(lsmi, num_intervals)
lsmi |
Level set multi-index. |
num_intervals |
Number of intervals. |
A flat index corresponding to the multi-index.
Convert level set flat index (lsfi) to multi-index (lsmi)
to_lsmi(lsfi, num_intervals)
to_lsmi(lsfi, num_intervals)
lsfi |
Level set flat index. |
num_intervals |
Number of intervals. |
A multi-index corresponding to the flat index.