# 🦋 LSDIPro SS2025 ## 📄 [SilkMoth: An Efficient Method for Finding Related Sets](https://doi.org/10.14778/3115404.3115413) A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery. --- ## 👥 Team Members - **Andreas Wilms** - **Sarra Daknou** - **Amina Iqbal** - **Jakob Berschneider** --- ## 📊 Experiments & Results ➡️ [**See Experiments**](experiments/README.md) --- ## 🧪 Interactive Demo Follow our **step-by-step Jupyter Notebook demo** for a hands-on understanding of SilkMoth 📓 [**Open demo_example.ipynb**](demo_example.ipynb) --- ## Table of Contents - [1. Large Scale Data Integration Project (LSDIPro)](#1-large-scale-data-integration-project-lsdipro) - [2. What is SilkMoth? 🐛](#2-what-is-silkmoth) - [3. The Problem 🧩](#3-the-problem) - [4. SilkMoth’s Solution 🚀](#4-silkmoths-solution) - [5. Core Pipeline Steps 🔁](#5-core-pipeline-steps) - [5.1 Tokenization](#51-tokenization) - [5.2 Inverted Index Construction](#52-inverted-index-construction) - [5.3 Signature Generation](#53-signature-generation) - [5.4 Candidate Selection](#54-candidate-selection) - [5.5 Refinement Filters](#55-refinement-filters) - [5.6 Verification via Maximum Matching](#56-verification-via-maximum-matching) - [6. Modes of Operation 🧪](#6-modes-of-operation-) - [7. Supported Similarity Functions 📐](#7-supported-similarity-functions-) - [8. Installing from Source](#8-installing-from-source) - [9. Experiment Results](#9-experiment-results) --- ## 1. Large Scale Data Integration Project (LSDIPro) As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python. The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes. The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale. --- ## 2. What is SilkMoth? **SilkMoth** is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar. This is especially important in **data integration**, **data cleaning**, and **information retrieval**, where messy or inconsistent data is common. --- ## 3. The Problem Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using **similarity functions** (not just exact matches). A powerful approach models this as a **bipartite graph** and finds the **maximum matching score** between elements. However, this method is **computationally expensive** (`O(n³)` per pair), making it impractical for large datasets. --- ## 4. SilkMoth’s Solution SilkMoth tackles this with a three-step approach: 1. **Signature Generation**: Creates compact signatures for each set, ensuring related sets share signature parts. 2. **Pruning**: Filters out unrelated sets early, reducing candidates. 3. **Verification**: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster. --- ## 5. Core Pipeline Steps ![Figure 1: SILKMOTH Framework Overview](figures/Pipeline.png) *Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.* ### [5.1 Tokenization](pages/tokenizer.md) Each element in every set is tokenized based on the selected similarity function: - **Jaccard Similarity**: Elements are split into whitespace-delimited tokens. - **Edit Similarity**: Elements are split into overlapping `q`-grams (e.g., 3-grams). ### [5.2 Inverted Index Construction](pages/inverted_index.md) An **inverted index** is built from the reference set `R` to map each token to a list of `(set, element)` pairs in which it occurs. This allows fast lookup of candidate sets sharing tokens with a query. ### [5.3 Signature Generation](pages/signature_generator.md) A **signature** is a subset of tokens selected from each set such that: - Any related set must share at least one signature token. - Signature size is minimized to reduce candidate space. Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly. ### [5.4 Candidate Selection](pages/candidate_selector.md) For each set `R`, retrieve from the inverted index all sets `S` sharing at least one token with `R`’s signature. These become **candidate sets** for further evaluation. ### [5.5 Refinement Filters](pages/candidate_selector.md) Two filters reduce false positives among candidates: - **Check Filter**: Uses an upper bound on similarity to eliminate sets below threshold. - **Nearest Neighbor Filter**: Approximates maximum matching score using nearest neighbor similarity for each element in `R`. ### [5.6 Verification via Maximum Matching](pages/verifier.md) Compute **maximum weighted bipartite matching** between elements of `R` and `S` for remaining candidates using the similarity function as edge weights. Sets meeting or exceeding threshold `δ` are considered **related**. --- ## 6. Modes of Operation 🧪 - **Discovery Mode**: Compare all pairs of sets to find all related pairs. *Use case:* Finding related columns in databases. - **Search Mode**: Given a reference set, find all related sets. *Use case:* Schema matching or entity deduplication. --- ## 7. Supported Similarity Functions 📐 - **Jaccard Similarity** - **Edit Similarity** (Levenshtein-based) - Optional minimum similarity threshold `α` on element comparisons. --- ## 8. Installing from Source 1. Run `pip install src/` to install --- ## 9. Experiment Results [📊 See Experiments and Results](experiments/README.md)