Andreas Wilms 306fce9b53 init
Update README.md
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00
2025-09-09 19:45:44 +02:00

🦋 LSDIPro SS2025

A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery.


👥 Team Members

  • Andreas Wilms
  • Sarra Daknou
  • Amina Iqbal
  • Jakob Berschneider

📊 Experiments & Results

➡️ See Experiments


🧪 Interactive Demo

Follow our step-by-step Jupyter Notebook demo for a hands-on understanding of SilkMoth

📓 Open demo_example.ipynb


📘 Project Documentation

Table of Contents


1. Large Scale Data Integration Project (LSDIPro)

As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python. The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes.
The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale.


2. What is SilkMoth?

SilkMoth is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar.
This is especially important in data integration, data cleaning, and information retrieval, where messy or inconsistent data is common.


3. The Problem

Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using similarity functions (not just exact matches).
A powerful approach models this as a bipartite graph and finds the maximum matching score between elements. However, this method is computationally expensive (O(n³) per pair), making it impractical for large datasets.


4. SilkMoths Solution

SilkMoth tackles this with a three-step approach:

  1. Signature Generation: Creates compact signatures for each set, ensuring related sets share signature parts.
  2. Pruning: Filters out unrelated sets early, reducing candidates.
  3. Verification: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster.

5. Core Pipeline Steps

Figure 1: SILKMOTH Framework Overview

Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.

5.1 Tokenization

Each element in every set is tokenized based on the selected similarity function:

  • Jaccard Similarity: Elements are split into whitespace-delimited tokens.
  • Edit Similarity: Elements are split into overlapping q-grams (e.g., 3-grams).

5.2 Inverted Index Construction

An inverted index is built from the reference set R to map each token to a list of (set, element) pairs in which it occurs.
This allows fast lookup of candidate sets sharing tokens with a query.

5.3 Signature Generation

A signature is a subset of tokens selected from each set such that:

  • Any related set must share at least one signature token.
  • Signature size is minimized to reduce candidate space.

Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly.

5.4 Candidate Selection

For each set R, retrieve from the inverted index all sets S sharing at least one token with Rs signature. These become candidate sets for further evaluation.

5.5 Refinement Filters

Two filters reduce false positives among candidates:

  • Check Filter: Uses an upper bound on similarity to eliminate sets below threshold.
  • Nearest Neighbor Filter: Approximates maximum matching score using nearest neighbor similarity for each element in R.

5.6 Verification via Maximum Matching

Compute maximum weighted bipartite matching between elements of R and S for remaining candidates using the similarity function as edge weights.
Sets meeting or exceeding threshold δ are considered related.


6. Modes of Operation 🧪

  • Discovery Mode: Compare all pairs of sets to find all related pairs.
    Use case: Finding related columns in databases.

  • Search Mode: Given a reference set, find all related sets.
    Use case: Schema matching or entity deduplication.


7. Supported Similarity Functions 📐

  • Jaccard Similarity
  • Edit Similarity (Levenshtein-based)
  • Optional minimum similarity threshold α on element comparisons.

8. Installing from Source

  1. Run pip install src/ to install

9. Experiment Results

📊 See Experiments and Results

Description
No description provided
Readme 28 MiB
Languages
Python 76.5%
Jupyter Notebook 22.1%
Shell 0.7%
Batchfile 0.7%