This commit is contained in:
Andreas Wilms
2025-09-08 19:05:42 +02:00
commit d85c1c86df
153 changed files with 140246 additions and 0 deletions

15
.gitignore vendored Normal file
View File

@@ -0,0 +1,15 @@
.venv/
__pycache__/
silkmoth.egg-info/
build/
dist/
site/
reference_sets_inclusion_dependency.json
reference_sets_inclusion_dependency_reduction.json
source_sets_inclusion_dependency.json
webtable_schemas_sets_500k.json
github_webtable_schemas_sets_500k.json
.vscode/
silkmoth_env/

157
README.md Normal file
View File

@@ -0,0 +1,157 @@
# 🦋 LSDIPro SS2025
## 📄 [SilkMoth: An Efficient Method for Finding Related Sets](https://doi.org/10.14778/3115404.3115413)
A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery.
---
## 👥 Team Members
- **Andreas Wilms**
- **Sarra Daknou**
- **Amina Iqbal**
- **Jakob Berschneider**
---
## 📊 Experiments & Results
➡️ [**See Experiments**](experiments/README.md)
---
## 📚 Check out our documentation site
👉 [andre-devv.github.io/LSDIPro-SilkMoth](https://andre-devv.github.io/LSDIPro-SilkMoth/)
---
## 🧪 Interactive Demo
Follow our **step-by-step Jupyter Notebook demo** for a hands-on understanding of SilkMoth
📓 [**Open demo_example.ipynb**](demo_example.ipynb)
---
# 📘 Project Documentation
## Table of Contents
- [1. Large Scale Data Integration Project (LSDIPro)](#1-large-scale-data-integration-project-lsdipro)
- [2. What is SilkMoth? 🐛](#2-what-is-silkmoth)
- [3. The Problem 🧩](#3-the-problem)
- [4. SilkMoths Solution 🚀](#4-silkmoths-solution)
- [5. Core Pipeline Steps 🔁](#5-core-pipeline-steps)
- [5.1 Tokenization](#51-tokenization)
- [5.2 Inverted Index Construction](#52-inverted-index-construction)
- [5.3 Signature Generation](#53-signature-generation)
- [5.4 Candidate Selection](#54-candidate-selection)
- [5.5 Refinement Filters](#55-refinement-filters)
- [5.6 Verification via Maximum Matching](#56-verification-via-maximum-matching)
- [6. Modes of Operation 🧪](#6-modes-of-operation-)
- [7. Supported Similarity Functions 📐](#7-supported-similarity-functions-)
- [8. Installing from Source](#8-installing-from-source)
- [9. Experiment Results](#9-experiment-results)
---
## 1. Large Scale Data Integration Project (LSDIPro)
As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python. The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes.
The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale.
---
## 2. What is SilkMoth?
**SilkMoth** is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar.
This is especially important in **data integration**, **data cleaning**, and **information retrieval**, where messy or inconsistent data is common.
---
## 3. The Problem
Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using **similarity functions** (not just exact matches).
A powerful approach models this as a **bipartite graph** and finds the **maximum matching score** between elements. However, this method is **computationally expensive** (`O(n³)` per pair), making it impractical for large datasets.
---
## 4. SilkMoths Solution
SilkMoth tackles this with a three-step approach:
1. **Signature Generation**: Creates compact signatures for each set, ensuring related sets share signature parts.
2. **Pruning**: Filters out unrelated sets early, reducing candidates.
3. **Verification**: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster.
---
## 5. Core Pipeline Steps
![Figure 1: SILKMOTH Framework Overview](docs/figures/Pipeline.png)
*Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.*
### 5.1 Tokenization
Each element in every set is tokenized based on the selected similarity function:
- **Jaccard Similarity**: Elements are split into whitespace-delimited tokens.
- **Edit Similarity**: Elements are split into overlapping `q`-grams (e.g., 3-grams).
### 5.2 Inverted Index Construction
An **inverted index** is built from the reference set `R` to map each token to a list of `(set, element)` pairs in which it occurs.
This allows fast lookup of candidate sets sharing tokens with a query.
### 5.3 Signature Generation
A **signature** is a subset of tokens selected from each set such that:
- Any related set must share at least one signature token.
- Signature size is minimized to reduce candidate space.
Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly.
### 5.4 Candidate Selection
For each set `R`, retrieve from the inverted index all sets `S` sharing at least one token with `R`s signature. These become **candidate sets** for further evaluation.
### 5.5 Refinement Filters
Two filters reduce false positives among candidates:
- **Check Filter**: Uses an upper bound on similarity to eliminate sets below threshold.
- **Nearest Neighbor Filter**: Approximates maximum matching score using nearest neighbor similarity for each element in `R`.
### 5.6 Verification via Maximum Matching
Compute **maximum weighted bipartite matching** between elements of `R` and `S` for remaining candidates using the similarity function as edge weights.
Sets meeting or exceeding threshold `δ` are considered **related**.
---
## 6. Modes of Operation 🧪
- **Discovery Mode**: Compare all pairs of sets to find all related pairs.
*Use case:* Finding related columns in databases.
- **Search Mode**: Given a reference set, find all related sets.
*Use case:* Schema matching or entity deduplication.
---
## 7. Supported Similarity Functions 📐
- **Jaccard Similarity**
- **Edit Similarity** (Levenshtein-based)
- Optional minimum similarity threshold `α` on element comparisons.
---
## 8. Installing from Source
1. Run `pip install src/` to install
---
## 9. Experiment Results
[📊 See Experiments and Results](experiments/README.md)

823
demo_example.ipynb Normal file
View File

@@ -0,0 +1,823 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c9f89a47",
"metadata": {},
"source": [
"## SilkMoth Demo"
]
},
{
"cell_type": "markdown",
"id": "2ca15800",
"metadata": {},
"source": [
"### Related Set Discovery task under SetContainment using Jaccard Similarity"
]
},
{
"cell_type": "markdown",
"id": "ea6ce5fb",
"metadata": {},
"source": [
"Import of all required modules:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "bdd1b92c",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"src\")\n",
"\n",
"from silkmoth.tokenizer import Tokenizer\n",
"from silkmoth.inverted_index import InvertedIndex\n",
"from silkmoth.signature_generator import SignatureGenerator\n",
"from silkmoth.candidate_selector import CandidateSelector\n",
"from silkmoth.verifier import Verifier\n",
"from silkmoth.silkmoth_engine import SilkMothEngine\n",
"\n",
"\n",
"from silkmoth.utils import jaccard_similarity, contain, edit_similarity, similar, SigType\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from IPython.display import display, Markdown\n",
"\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "bf6bf1f5",
"metadata": {},
"source": [
"Define example related dataset from \"SilkMoth\" paper (reference set **R** and source sets **S**)\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "598a4bbf",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Reference set (R):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[0]: “77 Mass Ave Boston MA”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[1]: “5th St 02115 Seattle WA”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[2]: “77 5th St Chicago IL”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"**Source sets (S):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Location Dataset\n",
"reference_set = [\n",
" '77 Mass Ave Boston MA',\n",
" '5th St 02115 Seattle WA',\n",
" '77 5th St Chicago IL'\n",
"]\n",
"\n",
"# Address Dataset\n",
"source_sets = [\n",
" ['Mass Ave St Boston 02115','77 Mass 5th St Boston','77 Mass Ave 5th 02115'],\n",
" ['77 Boston MA','77 5th St Boston 02115','77 Mass Ave 02115 Seattle'],\n",
" ['77 Mass Ave 5th Boston MA','Mass Ave Chicago IL','77 Mass Ave St'],\n",
" ['77 Mass Ave MA','5th St 02115 Seattle WA','77 5th St Boston Seattle']\n",
"]\n",
"\n",
"# thresholds & q\n",
"δ = 0.7\n",
"α = 0.0\n",
"q = 3\n",
"\n",
"display(Markdown(\"**Reference set (R):**\"))\n",
"for i, r in enumerate(reference_set):\n",
" display(Markdown(f\"- R[{i}]: “{r}”\"))\n",
"display(Markdown(\"**Source sets (S):**\"))\n",
"for j, S in enumerate(source_sets):\n",
" display(Markdown(f\"- S[{j}]: “{' | '.join(S)}”\"))"
]
},
{
"cell_type": "markdown",
"id": "a50b350a",
"metadata": {},
"source": [
"### 1. Tokenization\n",
"Tokenize each element of R and each S using Jaccard Similarity (whitespace tokens)\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "55e7b5d0",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Tokenized Reference set (R):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[0]: {'Ave', 'MA', '77', 'Boston', 'Mass'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[1]: {'5th', 'Seattle', 'St', 'WA', '02115'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[2]: {'77', '5th', 'IL', 'St', 'Chicago'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"**Tokenized Source sets (S):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[0]: [{'Ave', 'Boston', 'St', 'Mass', '02115'}, {'77', 'Boston', '5th', 'St', 'Mass'}, {'Ave', '77', '5th', 'Mass', '02115'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[1]: [{'Boston', 'MA', '77'}, {'77', 'Boston', '5th', 'St', '02115'}, {'Ave', '77', 'Seattle', 'Mass', '02115'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[2]: [{'Ave', 'MA', '77', 'Boston', '5th', 'Mass'}, {'IL', 'Ave', 'Mass', 'Chicago'}, {'St', 'Ave', 'Mass', '77'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[3]: [{'Ave', 'Mass', '77', 'MA'}, {'5th', 'Seattle', 'St', 'WA', '02115'}, {'77', 'Boston', '5th', 'Seattle', 'St'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizer = Tokenizer(jaccard_similarity, q)\n",
"tokenized_R = tokenizer.tokenize(reference_set)\n",
"tokenized_S = [tokenizer.tokenize(S) for S in source_sets]\n",
"\n",
"display(Markdown(\"**Tokenized Reference set (R):**\"))\n",
"for i, toks in enumerate(tokenized_R):\n",
" display(Markdown(f\"- Tokens of R[{i}]: {toks}\"))\n",
"\n",
"display(Markdown(\"**Tokenized Source sets (S):**\"))\n",
"for i, toks in enumerate(tokenized_S):\n",
" display(Markdown(f\"- Tokens of S[{i}]: {toks}\"))"
]
},
{
"cell_type": "markdown",
"id": "e17b807b",
"metadata": {},
"source": [
"### 2. Build Inverted Index\n",
"Builds an inverted index on the tokenized source sets and shows an example lookup."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "22c7d1d6",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Index built over 4 source sets."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Example: token “Mass” appears in [(0, 0), (0, 1), (0, 2), (1, 2), (2, 0), (2, 1), (2, 2), (3, 0)]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"index = InvertedIndex(tokenized_S)\n",
"display(Markdown(f\"- Index built over {len(source_sets)} source sets.\"))\n",
"display(Markdown(f\"- Example: token “Mass” appears in {index.get_indexes('Mass')}\"))\n"
]
},
{
"cell_type": "markdown",
"id": "cc17daac",
"metadata": {},
"source": [
"### 3. Signature Generation"
]
},
{
"cell_type": "markdown",
"id": "1c48bac2",
"metadata": {},
"source": [
"Generates the weighted signature for R given δ, α (here α=0), using Jaccard Similarity."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "a36be65c",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Selected signature tokens: **['Chicago', 'WA', 'IL', '5th']**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sig_gen = SignatureGenerator()\n",
"signature = sig_gen.get_signature(\n",
" tokenized_R, index,\n",
" delta=δ, alpha=α,\n",
" sig_type=SigType.WEIGHTED,\n",
" sim_fun=jaccard_similarity,\n",
" q=q\n",
")\n",
"display(Markdown(f\"- Selected signature tokens: **{signature}**\"))"
]
},
{
"cell_type": "markdown",
"id": "938be3e2",
"metadata": {},
"source": [
"### 4. Initial Candidate Selection\n",
"\n",
"Looks up each signature token in the inverted index to form the candidate set.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "58017e27",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Candidate set indices: **[0, 1, 2, 3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cand_sel = CandidateSelector(\n",
" similarity_func=jaccard_similarity,\n",
" sim_metric=contain,\n",
" related_thresh=δ,\n",
" sim_thresh=α,\n",
" q=q\n",
")\n",
"\n",
"initial_cands = cand_sel.get_candidates(signature, index, len(tokenized_R))\n",
"display(Markdown(f\"- Candidate set indices: **{sorted(initial_cands)}**\"))\n",
"for j in sorted(initial_cands):\n",
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))"
]
},
{
"cell_type": "markdown",
"id": "d633e5f9",
"metadata": {},
"source": [
"### 5. Check Filter\n",
"Prunes candidates by ensuring each matched element passes the local similarity bound.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "9a2bfdeb",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Surviving after check filter:** **[0, 1, 3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[0] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **0.429** | Matched elements: **1**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[1] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **0.429** | Matched elements: **1**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[3] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[1] “5th St 02115 Seattle WA” → sim = 1.000"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **1.000** | Matched elements: **2**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"filtered_cands, match_map = cand_sel.check_filter(\n",
" tokenized_R, set(signature), initial_cands, index\n",
")\n",
"display(Markdown(f\"**Surviving after check filter:** **{sorted(filtered_cands)}**\"))\n",
"for j in sorted(filtered_cands):\n",
" display(Markdown(f\"S[{j}] matched:\"))\n",
" for r_idx, sim in match_map[j].items():\n",
" sim_text = f\"{sim:.3f}\"\n",
" display(Markdown(f\" • R[{r_idx}] “{reference_set[r_idx]}” → sim = {sim_text}\"))\n",
" \n",
" matches = match_map.get(j, {})\n",
" if matches:\n",
" best_sim = max(matches.values())\n",
" num_matches = len(matches)\n",
" display(Markdown(f\" → Best sim: **{best_sim:.3f}** | Matched elements: **{num_matches}**\"))\n",
" else:\n",
" display(Markdown(f\"No elements passed similarity checks.\"))\n"
]
},
{
"cell_type": "markdown",
"id": "cc37bb7f",
"metadata": {},
"source": [
"### 6. NearestNeighbor Filter\n",
"\n",
"Further prunes via nearestneighbor upper bounds on total matching score.\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "aa9b7a63",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Surviving after NN filter: **[3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nn_filtered = cand_sel.nn_filter(\n",
" tokenized_R, set(signature), filtered_cands,\n",
" index, threshold=δ, match_map=match_map\n",
")\n",
"display(Markdown(f\"- Surviving after NN filter: **{sorted(nn_filtered)}**\"))\n",
"for j in nn_filtered:\n",
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))\n"
]
},
{
"cell_type": "markdown",
"id": "8638f83a",
"metadata": {},
"source": [
"### 7. Verification\n",
"\n",
"Runs the bipartite maxmatching on the remaining candidates and outputs the final related sets.\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "ebdf20fe",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Final related sets (score ≥ 0.7):"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle” → **0.743**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"verifier = Verifier(δ, contain, jaccard_similarity, sim_thresh=α, reduction=False)\n",
"results = verifier.get_related_sets(tokenized_R, nn_filtered, index)\n",
"\n",
"if results:\n",
" display(Markdown(f\"Final related sets (score ≥ {δ}):\"))\n",
" for j, score in results:\n",
" display(Markdown(f\" • S[{j}]: “{' | '.join(source_sets[j])}” → **{score:.3f}**\"))\n",
"else:\n",
" display(Markdown(\"- No sets passed verification.\"))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "silkmoth_env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

BIN
docs/ImplementationPlan.pdf Normal file

Binary file not shown.

3
docs/README.md Normal file
View File

@@ -0,0 +1,3 @@
The initial draft of the SilkMoth system and process was created using Draw.io. Refer to the file `SilkMoth.drawio` and its exported image, `SilkMoth.png`.
For a detailed implementation plan refer to `plan.tex` and `ImplementationPlan.pdf`.

406
docs/SilkMoth.drawio Normal file
View File

@@ -0,0 +1,406 @@
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" version="26.2.14">
<diagram name="Page-1" id="a6IaXev5Jbf4Zx6BKyVR">
<mxGraphModel dx="3390" dy="2158" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" background="#ffffff" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="rYVZWEPrfZzp95ZC9z8C-159" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-1" target="rYVZWEPrfZzp95ZC9z8C-3">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="280" y="265" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-1" value="&lt;i&gt;R&lt;/i&gt; = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="196.75" y="30" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-153" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-2" target="rYVZWEPrfZzp95ZC9z8C-152">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-2" value="&lt;i&gt;S&lt;/i&gt; = {S1, S2, S3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-686" y="245" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-38" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-3" target="rYVZWEPrfZzp95ZC9z8C-36">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-3" value="Tokenize R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="27.5" y="240" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-131" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-140">
<mxGeometry relative="1" as="geometry">
<mxPoint x="29.40000000000009" y="-149.99999999999977" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-132" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-136">
<mxGeometry relative="1" as="geometry">
<mxPoint x="198.29999999999973" y="-149.99999999999977" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-6" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="100" y="-170" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-167" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-14" target="rYVZWEPrfZzp95ZC9z8C-164">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-14" value="relatedness&amp;nbsp;&lt;div&gt;threshold &lt;span class=&quot;katex&quot;&gt;&lt;span style=&quot;height: 0.6944em;&quot; class=&quot;strut&quot;&gt;&lt;/span&gt;&lt;span style=&quot;margin-right: 0.0379em;&quot; class=&quot;mord mathnormal&quot;&gt;δ&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-511" y="675" width="200" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-155" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-22" target="rYVZWEPrfZzp95ZC9z8C-26">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-156" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-22" target="rYVZWEPrfZzp95ZC9z8C-24">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-22" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="100" y="-370" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-178" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-24" target="rYVZWEPrfZzp95ZC9z8C-152">
<mxGeometry relative="1" as="geometry">
<mxPoint x="-430" y="140" as="targetPoint" />
<Array as="points">
<mxPoint x="-390" y="-350" />
<mxPoint x="-390" y="40" />
<mxPoint x="-410" y="40" />
<mxPoint x="-410" y="70" />
<mxPoint x="-394" y="70" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-180" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-24" target="rYVZWEPrfZzp95ZC9z8C-6">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="-60" y="-220" />
<mxPoint x="120" y="-220" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-24" value="Jaccard&lt;div&gt;(whitespace words)&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-144.5" y="-370" width="180" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-179" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-26" target="rYVZWEPrfZzp95ZC9z8C-152">
<mxGeometry relative="1" as="geometry">
<mxPoint x="-420" y="190" as="targetPoint" />
<Array as="points">
<mxPoint x="282" y="-400" />
<mxPoint x="-390" y="-400" />
<mxPoint x="-390" y="40" />
<mxPoint x="-410" y="40" />
<mxPoint x="-410" y="70" />
<mxPoint x="-394" y="70" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-181" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-26" target="rYVZWEPrfZzp95ZC9z8C-6">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="282" y="-220" />
<mxPoint x="120" y="-220" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-26" value=" Edit Similarity&lt;div&gt;(q-gram)&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="196.75" y="-370" width="170" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-28" value="similarity&amp;nbsp;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot;&gt;threshold&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot; class=&quot;katex&quot;&gt;&lt;span style=&quot;height: 0.4306em;&quot; class=&quot;strut&quot;&gt;&lt;/span&gt;&lt;span style=&quot;margin-right: 0.0037em;&quot; class=&quot;mord mathnormal&quot;&gt;α&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot; class=&quot;katex&quot;&gt;&lt;span style=&quot;margin-right: 0.0037em;&quot; class=&quot;mord mathnormal&quot;&gt;baseline = 0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-291" y="-370" width="190" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-40" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36" target="rYVZWEPrfZzp95ZC9z8C-39">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-168" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36" target="rYVZWEPrfZzp95ZC9z8C-47">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-36" value="R Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="22.5" y="340" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-45" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-39" target="rYVZWEPrfZzp95ZC9z8C-44">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-39" value="Inverted Index Creation" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="27.5" y="505" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-69" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-44" target="rYVZWEPrfZzp95ZC9z8C-67">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-44" value="Inverted Index" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="320" y="500" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-170" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-47" target="rYVZWEPrfZzp95ZC9z8C-169">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-47" value="Signature Generation R&lt;div&gt;(weighted)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="280" y="335" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-68" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-63" target="rYVZWEPrfZzp95ZC9z8C-67">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-63" value="S Signatures" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="320" y="665" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-154" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-66" target="rYVZWEPrfZzp95ZC9z8C-22">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-66" value="Start" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.start_2;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="90" y="-530" width="60" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-67" value="Candidate Selection" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="555" y="670" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-71" value="&lt;div&gt;&lt;br&gt;&lt;/div&gt;Candidates&lt;div&gt;&lt;br&gt;&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="800" y="665" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-72" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-67" target="rYVZWEPrfZzp95ZC9z8C-71">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" target="rYVZWEPrfZzp95ZC9z8C-87">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-73" value="Check Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="1110" y="670" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-100" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-73">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-105" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-100">
<mxGeometry x="-0.2471" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-107" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-106">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-108" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-107">
<mxGeometry x="-0.3013" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-77" value="Refinement" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="960" y="655" width="80" height="80" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-109" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-106">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1398" y="835" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-87" value="NN Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="1320" y="670" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-71" target="rYVZWEPrfZzp95ZC9z8C-77">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-116" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-106" target="rYVZWEPrfZzp95ZC9z8C-115">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-106" value="Verification" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="922.5" y="810" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-110" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-71">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1350" y="620" as="sourcePoint" />
<mxPoint x="1000" y="600" as="targetPoint" />
<Array as="points">
<mxPoint x="1398" y="600" />
<mxPoint x="845" y="600" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-114" value="update" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-110">
<mxGeometry x="0.5022" y="-2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-113" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1180" y="670" as="sourcePoint" />
<mxPoint x="1188" y="600" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-117">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-121" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-118">
<mxGeometry x="0.0133" y="-3" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-119">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-123" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-122">
<mxGeometry x="-0.2333" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-115" value="use triangle optimization" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="955" y="920" width="90" height="100" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-117" target="rYVZWEPrfZzp95ZC9z8C-119">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1198" y="1095" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-117" value="Triangle Optimization" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="1120" y="945" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-127" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-119" target="rYVZWEPrfZzp95ZC9z8C-126">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-119" value="Create Bipartite Matching Graph" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="931.25" y="1070" width="137.5" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-120" value="" style="endArrow=none;dashed=1;html=1;dashPattern=1 3;strokeWidth=2;rounded=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" target="rYVZWEPrfZzp95ZC9z8C-71">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="930" y="1095" as="sourcePoint" />
<mxPoint x="845" y="730" as="targetPoint" />
<Array as="points">
<mxPoint x="845" y="1095" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-125" value="using" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-120">
<mxGeometry x="-0.2193" y="1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-185" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-126" target="rYVZWEPrfZzp95ZC9z8C-182">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-126" value="&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;Related SETS&lt;/div&gt;&lt;div&gt;(R,S)&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="955" y="1160" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-128" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="965" y="1420" width="70" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-138" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-136" target="rYVZWEPrfZzp95ZC9z8C-1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-136" value="RELATED SET&amp;nbsp;&lt;div&gt;SEARCH (target)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="199.25" y="-175" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-139" target="rYVZWEPrfZzp95ZC9z8C-142">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-144" value="x = 1" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-143">
<mxGeometry x="-0.24" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-139" value="&lt;i&gt;R&lt;/i&gt; = {R1, R2, R3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-115" y="-70" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-141" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-140" target="rYVZWEPrfZzp95ZC9z8C-139">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-140" value="RELATED SET&amp;nbsp;&lt;div&gt;DISCOVERY (general)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-110" y="-175" width="150" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-146" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-142" target="rYVZWEPrfZzp95ZC9z8C-145">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-142" value="Take SET Rx" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-87.5" y="30" width="105" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-160" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-145" target="rYVZWEPrfZzp95ZC9z8C-3">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-145" value="&lt;i&gt;Rx&lt;/i&gt; = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-115" y="120" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-142">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-191" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-149">
<mxGeometry x="-0.4171" y="2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-148" value="x &amp;lt;= R.length" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-260" y="20" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-150" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-250" y="120" width="70" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-150">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-192" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-151">
<mxGeometry x="-0.1457" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-162" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-152" target="rYVZWEPrfZzp95ZC9z8C-161">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-152" value="Tokenize S" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-471" y="240" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-163" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-161" target="rYVZWEPrfZzp95ZC9z8C-39">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="-179" y="530" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-165" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-161" target="rYVZWEPrfZzp95ZC9z8C-164">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-161" value="S Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-476" y="340" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-164" value="Signature Generation S&lt;div&gt;(weighted)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-256" y="670" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-166" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-164" target="rYVZWEPrfZzp95ZC9z8C-63">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-171" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-169" target="rYVZWEPrfZzp95ZC9z8C-67">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-169" value="R Signature" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="550" y="340" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-173" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-172" target="rYVZWEPrfZzp95ZC9z8C-47">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-172" value="relatedness&amp;nbsp;&lt;div&gt;threshold &lt;span class=&quot;katex&quot;&gt;&lt;span style=&quot;height: 0.6944em;&quot; class=&quot;strut&quot;&gt;&lt;/span&gt;&lt;span style=&quot;margin-right: 0.0379em;&quot; class=&quot;mord mathnormal&quot;&gt;δ&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="257.5" y="404" width="200" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-187" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-186">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-190" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-187">
<mxGeometry x="-0.184" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-182" value="DISCOVERY&lt;div&gt;Mode&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="945" y="1270" width="110" height="80" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-183" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-128">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-184" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-183">
<mxGeometry x="-0.1257" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-188" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-186" target="rYVZWEPrfZzp95ZC9z8C-148">
<mxGeometry relative="1" as="geometry">
<mxPoint x="-760" y="50" as="targetPoint" />
<Array as="points">
<mxPoint x="-759" y="1310" />
<mxPoint x="-759" y="50" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-186" value="Increment x" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="715" y="1285" width="137.5" height="50" as="geometry" />
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>

BIN
docs/SilkMoth.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

494
docs/SilkMoth_v2.drawio Normal file
View File

@@ -0,0 +1,494 @@
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36" version="24.8.6">
<diagram name="Page-1" id="a6IaXev5Jbf4Zx6BKyVR">
<mxGraphModel dx="3785" dy="2313" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" background="#ffffff" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-120" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-1" target="rYVZWEPrfZzp95ZC9z8C-3">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="280" y="150" />
<mxPoint x="111" y="150" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-1" value="&lt;i&gt;R&lt;/i&gt; = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="190.75" y="-80" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-3">
<mxGeometry relative="1" as="geometry">
<mxPoint x="110.5" y="340" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-3" value="Tokenize R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="33" y="240" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-131" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-140" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="33.40000000000009" y="-354.9999999999998" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-132" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-136" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="202.29999999999973" y="-354.9999999999998" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-6" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="104" y="-375" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-123" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36">
<mxGeometry relative="1" as="geometry">
<mxPoint x="105.5" y="420" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-36" value="R Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="23" y="343" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-102" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-66" target="W6bMp2RoBO1kHS_2JlRQ-100">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-66" value="Start" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.start_2;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="90" y="-610" width="60" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-67" value="Candidate Selection" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1002" y="99" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-71" value="&lt;div&gt;&lt;br&gt;&lt;/div&gt;Candidates&lt;div&gt;&lt;br&gt;&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1395" y="-45" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" target="rYVZWEPrfZzp95ZC9z8C-87" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-73" value="Check Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1705" y="-40" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-100" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-73" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-105" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-100" vertex="1" connectable="0">
<mxGeometry x="-0.2471" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-107" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-106" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-108" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-107" vertex="1" connectable="0">
<mxGeometry x="-0.3013" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-77" value="Refinement" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1555" y="-55" width="80" height="80" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-109" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-106" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1993" y="125" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-87" value="NN Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1915" y="-40" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-71" target="rYVZWEPrfZzp95ZC9z8C-77" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-116" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-106" target="rYVZWEPrfZzp95ZC9z8C-115" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-106" value="Verification" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1517.5" y="100" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-110" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-71" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1945" y="-90" as="sourcePoint" />
<mxPoint x="1595" y="-110" as="targetPoint" />
<Array as="points">
<mxPoint x="1993" y="-110" />
<mxPoint x="1440" y="-110" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-114" value="update" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-110" vertex="1" connectable="0">
<mxGeometry x="0.5022" y="-2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-113" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1775" y="-40" as="sourcePoint" />
<mxPoint x="1783" y="-110" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-117" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-121" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-118" vertex="1" connectable="0">
<mxGeometry x="0.0133" y="-3" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-119" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-123" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-122" vertex="1" connectable="0">
<mxGeometry x="-0.2333" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-115" value="use triangle optimization" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1550" y="210" width="90" height="100" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-117" target="rYVZWEPrfZzp95ZC9z8C-119" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1793" y="385" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-117" value="Triangle Optimization" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1715" y="235" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-63" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-119" target="W6bMp2RoBO1kHS_2JlRQ-64">
<mxGeometry relative="1" as="geometry">
<mxPoint x="1595" y="460" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-119" value="Create Bipartite Matching Graph" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1526.25" y="360" width="137.5" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-120" value="" style="endArrow=none;dashed=1;html=1;dashPattern=1 3;strokeWidth=2;rounded=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" target="rYVZWEPrfZzp95ZC9z8C-71" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1525" y="385" as="sourcePoint" />
<mxPoint x="1440" y="20" as="targetPoint" />
<Array as="points">
<mxPoint x="1440" y="385" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-125" value="using" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-120" vertex="1" connectable="0">
<mxGeometry x="-0.2193" y="1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-185" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-126" target="rYVZWEPrfZzp95ZC9z8C-182" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-126" value="&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;Related SETS&lt;/div&gt;&lt;div&gt;(R,S)&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1550" y="600" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-128" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1560" y="860" width="70" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-83" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-136" target="W6bMp2RoBO1kHS_2JlRQ-82">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-136" value="RELATED SET&amp;nbsp;&lt;div&gt;SEARCH (target)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="203.25" y="-380" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-139" target="rYVZWEPrfZzp95ZC9z8C-142" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-144" value="x = 1" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-143" vertex="1" connectable="0">
<mxGeometry x="-0.24" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-139" value="&lt;i&gt;R&lt;/i&gt; = {R1, R2, R3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-121" y="-180" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-85" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-140" target="W6bMp2RoBO1kHS_2JlRQ-81">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-140" value="RELATED SET&amp;nbsp;&lt;div&gt;DISCOVERY (general)&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-106" y="-380" width="150" height="50" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-146" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-142" target="rYVZWEPrfZzp95ZC9z8C-145" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-142" value="Take SET Rx" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-93.5" y="-80" width="105" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-119" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-145" target="rYVZWEPrfZzp95ZC9z8C-3">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-145" value="&lt;i&gt;Rx&lt;/i&gt; = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-121" y="10" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-142" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-191" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-149" vertex="1" connectable="0">
<mxGeometry x="-0.4171" y="2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-148" value="x &amp;lt;= R.length" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-266" y="-90" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-150" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="-256" y="10" width="70" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-150" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-192" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-151" vertex="1" connectable="0">
<mxGeometry x="-0.1457" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-136" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.25;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-169" target="rYVZWEPrfZzp95ZC9z8C-67">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-169" value="R Signature" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="694.08" y="230" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-187" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-186" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-190" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-187" vertex="1" connectable="0">
<mxGeometry x="-0.184" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-182" value="DISCOVERY&lt;div&gt;Mode&lt;/div&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1540" y="710" width="110" height="80" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-183" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-128" edge="1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-184" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-183" vertex="1" connectable="0">
<mxGeometry x="-0.1257" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-155" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-186" target="rYVZWEPrfZzp95ZC9z8C-148">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="-610" y="750" />
<mxPoint x="-610" y="-60" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="rYVZWEPrfZzp95ZC9z8C-186" value="Increment x" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
<mxGeometry x="1310" y="725" width="137.5" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-139" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-23">
<mxGeometry relative="1" as="geometry">
<mxPoint x="270.75000000000045" y="530" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-141" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-139">
<mxGeometry x="-0.2758" y="-1" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-152" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-23" target="rYVZWEPrfZzp95ZC9z8C-169">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="271" y="250" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-153" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-152">
<mxGeometry x="-0.873" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-23" value="&lt;font style=&quot;font-size: 8px;&quot;&gt;alpha = 0?&lt;/font&gt;" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="241.25" y="418" width="59" height="60" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-25">
<mxGeometry relative="1" as="geometry">
<mxPoint x="240" y="448" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-25" value="Weighted Signature Generation R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="23.5" y="423" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-31" value="Sim-thresh Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="195.75" y="534" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-41" target="W6bMp2RoBO1kHS_2JlRQ-54">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-144" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-143">
<mxGeometry x="0.1529" y="2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-41" target="W6bMp2RoBO1kHS_2JlRQ-146">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="550" y="345" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-150" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-149">
<mxGeometry x="-0.8299" y="2" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-41" value="&lt;font style=&quot;font-size: 9px;&quot;&gt;Optimization?&lt;/font&gt;" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="510.00000000000006" y="520" width="80" height="80" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-147" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-47">
<mxGeometry relative="1" as="geometry">
<mxPoint x="771.5799999999999" y="370" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-47" value="Dichotomy Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="694.08" y="440" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-148" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-53" target="W6bMp2RoBO1kHS_2JlRQ-146">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="910" y="345" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-53" value="Skyline Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="860" y="535" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-59" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-54" target="W6bMp2RoBO1kHS_2JlRQ-53">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-145" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-54" target="W6bMp2RoBO1kHS_2JlRQ-47">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-54" value="OR" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="745.33" y="534.5" width="52.5" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-64" value="&amp;nbsp;&lt;font style=&quot;font-size: 9px;&quot;&gt;relatedness&amp;nbsp;&lt;/font&gt;&lt;div&gt;&lt;font style=&quot;font-size: 9px;&quot;&gt;≥ δ&lt;/font&gt;&lt;/div&gt;" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="1555" y="450" width="80" height="80" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-65" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-64" target="rYVZWEPrfZzp95ZC9z8C-126">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-67" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-65">
<mxGeometry x="-0.1543" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-68" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-64" target="rYVZWEPrfZzp95ZC9z8C-128">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1689" y="490" />
<mxPoint x="1689" y="880" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-69" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-68">
<mxGeometry x="0.4329" y="-3" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-114" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-78">
<mxGeometry relative="1" as="geometry">
<mxPoint x="490" y="-160" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-78" value="&lt;i&gt;S&lt;/i&gt; = {S1, S2, S3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="39" y="-180" width="160" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-90" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-81" target="W6bMp2RoBO1kHS_2JlRQ-78">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-81" value="AND" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-51" y="-290" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-89" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-82" target="W6bMp2RoBO1kHS_2JlRQ-78">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-82" value="AND" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="260.75" y="-290" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-93" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.563;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-82" target="rYVZWEPrfZzp95ZC9z8C-1">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-94" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.563;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-81" target="rYVZWEPrfZzp95ZC9z8C-139">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-95" target="rYVZWEPrfZzp95ZC9z8C-6">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="-50" y="-420" />
<mxPoint x="124" y="-420" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-95" value="Jaccard&lt;div&gt;(whitespace words)&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-130.5" y="-490" width="180" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-104" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-96" target="rYVZWEPrfZzp95ZC9z8C-6">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="306" y="-420" />
<mxPoint x="124" y="-420" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-96" value=" Edit Similarity&lt;div&gt;(q-gram)&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="221" y="-490" width="170" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-97" value="similarity&amp;nbsp;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot;&gt;threshold&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot; class=&quot;katex&quot;&gt;&lt;span style=&quot;height: 0.4306em;&quot; class=&quot;strut&quot;&gt;&lt;/span&gt;&lt;span style=&quot;margin-right: 0.0037em;&quot; class=&quot;mord mathnormal&quot;&gt;α&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span style=&quot;background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));&quot; class=&quot;katex&quot;&gt;&lt;span style=&quot;margin-right: 0.0037em;&quot; class=&quot;mord mathnormal&quot;&gt;baseline = 0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="-277" y="-490" width="190" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-98" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-100" target="W6bMp2RoBO1kHS_2JlRQ-96">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-100" target="W6bMp2RoBO1kHS_2JlRQ-95">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-100" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="100" y="-490" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-108" value="Inverted Index Creation" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="500" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-137" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-109">
<mxGeometry relative="1" as="geometry">
<mxPoint x="1000.0000000000005" y="130" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-109" value="Inverted Index" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="532.5" y="100" width="90" height="60" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-115" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-111">
<mxGeometry relative="1" as="geometry">
<mxPoint x="568.5" y="-100" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-111" value="Tokenize S" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="491" y="-189" width="155" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-112" value="S Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
<mxGeometry x="490" y="-99" width="165" height="40" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-117" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.462;entryY=-0.033;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-112" target="W6bMp2RoBO1kHS_2JlRQ-108">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-108" target="W6bMp2RoBO1kHS_2JlRQ-109">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-138" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-31" target="W6bMp2RoBO1kHS_2JlRQ-41">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-146" value="" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="745.33" y="320" width="50" height="50" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.54;entryY=-0.088;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-146" target="rYVZWEPrfZzp95ZC9z8C-169">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-154" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-67" target="rYVZWEPrfZzp95ZC9z8C-71">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="1090" y="-15" />
</Array>
</mxGeometry>
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>

BIN
docs/SilkMoth_v2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 302 KiB

BIN
docs/figures/Pipeline.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

99
docs/plan.tex Normal file
View File

@@ -0,0 +1,99 @@
\documentclass[a4paper]{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{pgfgantt}
\usepackage{hyperref}
\title{Implementation Plan - Student Project SilkMoth}
\date{April 2025}
\begin{document}
\maketitle
Figure \ref{fig:plan} shows a more detailed version of our initial project plan. Note that some tasks may take longer or could be completed earlier than this plan assumes, and we are willing to adjust the plan according to our resources. We aim to parallelize the implementation tasks during the project whenever possible. We split the project into three phases as follows.
\begin{enumerate}
\item \textbf{(17.4 - 15.05)} - Core Pipeline
\begin{itemize}
\item Get a common understanding of the system
\item Implement the main components without major optimization
\item Prepare small data set to test correctness and larger data sets for evaluation phase
\item Goal: Runnable code for at least the base case (single search pass, similarity threshold $\alpha = 0$, similarity function $\phi = \texttt{Jac}$)
\end{itemize}
\item \textbf{(16.5 - 12.06)} - Extended Framework
\begin{itemize}
\item Improve the core pipeline
\item Refinement and optimization
\item Support for discovery mode, $\alpha \neq 0$ , $\phi = \texttt{Eds}$ and $\phi = \texttt{NEds}$
\item Goal: Most features should be finalized and ready for expert review
\end{itemize}
\item \textbf{(13.6 - 24.07)} - Evaluation
\begin{itemize}
\item Improve the system from the feedback and finalize the last functionalities
\item Implement the applications to conduct experiments
\item Visualize experiment results
\item Write report/documentation
\item Consider bonus improvements e.g. additional data sets like GitTables\footnote{\url{https://gittables.github.io/}} or additional similarity functions like Hamming similarity\footnote{\url{https://en.wikipedia.org/wiki/Hamming_distance}}
\item Goal: Presentation and submission of the final system
\end{itemize}
\end{enumerate}
\begin{figure}[b!]
\begin{ganttchart}[
vgrid, hgrid,
x unit=0.5cm,
y unit title=0.75cm,
y unit chart=0.5cm,
title height=1,
milestone left shift=.1,
milestone right shift=-.1,
group left shift=0,
group right shift=0,
group peaks tip position=0,
group peaks height=0.2,
title label font=\small,
bar label font=\small,
group label font=\small\bfseries,
milestone label font=\small\itshape,
]{1}{14}
\gantttitle[]{Project Plan [weeks]}{14} \\
\gantttitlelist{1,...,14}{1} \\
\ganttgroup{Milestone 1: Core Pipeline}{1}{4} \\
\ganttbar{Understand SilkMoth}{1}{1} \\
\ganttbar{System design of core pipeline}{2}{2} \\
\ganttbar{Data collection/preparation}{2}{4} \\
\ganttbar{Tokenizer}{3}{4} \\
\ganttbar{Inverted Index}{3}{4} \\
\ganttbar{Signature Generator}{3}{4} \\
\ganttbar{Maximum Matching Verification}{3}{4} \\
\ganttmilestone{Milestone 1 done}{4} \\
\ganttgroup{Milestone 2: Extended Framework}{5}{8} \\
\ganttbar{Discovery Mode}{5}{6} \\
\ganttbar{Check Filter}{5}{6} \\
\ganttbar{Nearest Neighbor Filter}{6}{7} \\
\ganttbar{Triangle Optimization}{6}{7} \\
\ganttbar{Support for $\alpha \neq 0$}{6}{8}\\
\ganttbar{Edit Similarity}{7}{8}\\
\ganttbar{Prepare for Experiments}{7}{8}\\
\ganttbar{Prepare for expert review}{8}{8} \\
\ganttmilestone{Milestone 2 done}{8} \\
\ganttgroup{Milestone 3: Evaluation}{9}{14} \\
\ganttbar{Improve system using feedback}{9}{9} \\
\ganttbar{Experiments: Inclusion Dependency}{9}{12} \\
\ganttbar{Experiments: String Matching}{9}{12} \\
\ganttbar{Experiments: Schema Matching}{9}{12} \\
\ganttbar{(Bonus)}{11}{12} \\
\ganttbar[bar/.append style={fill=gray, solid}]{Finalize Visualization and Documentation}{12}{14} \\
\ganttbar[bar/.append style={fill=gray, solid}]{Preparing presentation}{13}{14} \\
\ganttmilestone{Milestone 4 done}{14} \\
\ganttmilestone{Project done}{14}
\end{ganttchart}
\caption{Implementation Plan. First week starting from 17.04.2025.}
\label{fig:plan}
\end{figure}
\end{document}

8
docu/README.md Normal file
View File

@@ -0,0 +1,8 @@
### Generating Documentation Page
To generate a [documentation page](https://berscjak.github.io/) from source code with mkdocs, run the following from root directory:
```
pip install mkdocs mkdocstrings[python] mkdocs-awesome-pages-plugin
mkdocs serve
```

823
docu/demo_example.ipynb Normal file
View File

@@ -0,0 +1,823 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c9f89a47",
"metadata": {},
"source": [
"## SilkMoth Demo"
]
},
{
"cell_type": "markdown",
"id": "2ca15800",
"metadata": {},
"source": [
"### Related Set Discovery task under SetContainment using Jaccard Similarity"
]
},
{
"cell_type": "markdown",
"id": "ea6ce5fb",
"metadata": {},
"source": [
"Import of all required modules:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "bdd1b92c",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"src\")\n",
"\n",
"from silkmoth.tokenizer import Tokenizer\n",
"from silkmoth.inverted_index import InvertedIndex\n",
"from silkmoth.signature_generator import SignatureGenerator\n",
"from silkmoth.candidate_selector import CandidateSelector\n",
"from silkmoth.verifier import Verifier\n",
"from silkmoth.silkmoth_engine import SilkMothEngine\n",
"\n",
"\n",
"from silkmoth.utils import jaccard_similarity, contain, edit_similarity, similar, SigType\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from IPython.display import display, Markdown\n",
"\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "bf6bf1f5",
"metadata": {},
"source": [
"Define example related dataset from \"SilkMoth\" paper (reference set **R** and source sets **S**)\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "598a4bbf",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Reference set (R):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[0]: “77 Mass Ave Boston MA”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[1]: “5th St 02115 Seattle WA”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- R[2]: “77 5th St Chicago IL”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"**Source sets (S):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Location Dataset\n",
"reference_set = [\n",
" '77 Mass Ave Boston MA',\n",
" '5th St 02115 Seattle WA',\n",
" '77 5th St Chicago IL'\n",
"]\n",
"\n",
"# Address Dataset\n",
"source_sets = [\n",
" ['Mass Ave St Boston 02115','77 Mass 5th St Boston','77 Mass Ave 5th 02115'],\n",
" ['77 Boston MA','77 5th St Boston 02115','77 Mass Ave 02115 Seattle'],\n",
" ['77 Mass Ave 5th Boston MA','Mass Ave Chicago IL','77 Mass Ave St'],\n",
" ['77 Mass Ave MA','5th St 02115 Seattle WA','77 5th St Boston Seattle']\n",
"]\n",
"\n",
"# thresholds & q\n",
"δ = 0.7\n",
"α = 0.0\n",
"q = 3\n",
"\n",
"display(Markdown(\"**Reference set (R):**\"))\n",
"for i, r in enumerate(reference_set):\n",
" display(Markdown(f\"- R[{i}]: “{r}”\"))\n",
"display(Markdown(\"**Source sets (S):**\"))\n",
"for j, S in enumerate(source_sets):\n",
" display(Markdown(f\"- S[{j}]: “{' | '.join(S)}”\"))"
]
},
{
"cell_type": "markdown",
"id": "a50b350a",
"metadata": {},
"source": [
"### 1. Tokenization\n",
"Tokenize each element of R and each S using Jaccard Similarity (whitespace tokens)\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "55e7b5d0",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Tokenized Reference set (R):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[0]: {'Ave', 'MA', '77', 'Boston', 'Mass'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[1]: {'5th', 'Seattle', 'St', 'WA', '02115'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of R[2]: {'77', '5th', 'IL', 'St', 'Chicago'}"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"**Tokenized Source sets (S):**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[0]: [{'Ave', 'Boston', 'St', 'Mass', '02115'}, {'77', 'Boston', '5th', 'St', 'Mass'}, {'Ave', '77', '5th', 'Mass', '02115'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[1]: [{'Boston', 'MA', '77'}, {'77', 'Boston', '5th', 'St', '02115'}, {'Ave', '77', 'Seattle', 'Mass', '02115'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[2]: [{'Ave', 'MA', '77', 'Boston', '5th', 'Mass'}, {'IL', 'Ave', 'Mass', 'Chicago'}, {'St', 'Ave', 'Mass', '77'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Tokens of S[3]: [{'Ave', 'Mass', '77', 'MA'}, {'5th', 'Seattle', 'St', 'WA', '02115'}, {'77', 'Boston', '5th', 'Seattle', 'St'}]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizer = Tokenizer(jaccard_similarity, q)\n",
"tokenized_R = tokenizer.tokenize(reference_set)\n",
"tokenized_S = [tokenizer.tokenize(S) for S in source_sets]\n",
"\n",
"display(Markdown(\"**Tokenized Reference set (R):**\"))\n",
"for i, toks in enumerate(tokenized_R):\n",
" display(Markdown(f\"- Tokens of R[{i}]: {toks}\"))\n",
"\n",
"display(Markdown(\"**Tokenized Source sets (S):**\"))\n",
"for i, toks in enumerate(tokenized_S):\n",
" display(Markdown(f\"- Tokens of S[{i}]: {toks}\"))"
]
},
{
"cell_type": "markdown",
"id": "e17b807b",
"metadata": {},
"source": [
"### 2. Build Inverted Index\n",
"Builds an inverted index on the tokenized source sets and shows an example lookup."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "22c7d1d6",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Index built over 4 source sets."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"- Example: token “Mass” appears in [(0, 0), (0, 1), (0, 2), (1, 2), (2, 0), (2, 1), (2, 2), (3, 0)]"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"index = InvertedIndex(tokenized_S)\n",
"display(Markdown(f\"- Index built over {len(source_sets)} source sets.\"))\n",
"display(Markdown(f\"- Example: token “Mass” appears in {index.get_indexes('Mass')}\"))\n"
]
},
{
"cell_type": "markdown",
"id": "cc17daac",
"metadata": {},
"source": [
"### 3. Signature Generation"
]
},
{
"cell_type": "markdown",
"id": "1c48bac2",
"metadata": {},
"source": [
"Generates the weighted signature for R given δ, α (here α=0), using Jaccard Similarity."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "a36be65c",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Selected signature tokens: **['Chicago', 'WA', 'IL', '5th']**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sig_gen = SignatureGenerator()\n",
"signature = sig_gen.get_signature(\n",
" tokenized_R, index,\n",
" delta=δ, alpha=α,\n",
" sig_type=SigType.WEIGHTED,\n",
" sim_fun=jaccard_similarity,\n",
" q=q\n",
")\n",
"display(Markdown(f\"- Selected signature tokens: **{signature}**\"))"
]
},
{
"cell_type": "markdown",
"id": "938be3e2",
"metadata": {},
"source": [
"### 4. Initial Candidate Selection\n",
"\n",
"Looks up each signature token in the inverted index to form the candidate set.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "58017e27",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Candidate set indices: **[0, 1, 2, 3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cand_sel = CandidateSelector(\n",
" similarity_func=jaccard_similarity,\n",
" sim_metric=contain,\n",
" related_thresh=δ,\n",
" sim_thresh=α,\n",
" q=q\n",
")\n",
"\n",
"initial_cands = cand_sel.get_candidates(signature, index, len(tokenized_R))\n",
"display(Markdown(f\"- Candidate set indices: **{sorted(initial_cands)}**\"))\n",
"for j in sorted(initial_cands):\n",
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))"
]
},
{
"cell_type": "markdown",
"id": "d633e5f9",
"metadata": {},
"source": [
"### 5. Check Filter\n",
"Prunes candidates by ensuring each matched element passes the local similarity bound.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "9a2bfdeb",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Surviving after check filter:** **[0, 1, 3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[0] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **0.429** | Matched elements: **1**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[1] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **0.429** | Matched elements: **1**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"S[3] matched:"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[1] “5th St 02115 Seattle WA” → sim = 1.000"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" → Best sim: **1.000** | Matched elements: **2**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"filtered_cands, match_map = cand_sel.check_filter(\n",
" tokenized_R, set(signature), initial_cands, index\n",
")\n",
"display(Markdown(f\"**Surviving after check filter:** **{sorted(filtered_cands)}**\"))\n",
"for j in sorted(filtered_cands):\n",
" display(Markdown(f\"S[{j}] matched:\"))\n",
" for r_idx, sim in match_map[j].items():\n",
" sim_text = f\"{sim:.3f}\"\n",
" display(Markdown(f\" • R[{r_idx}] “{reference_set[r_idx]}” → sim = {sim_text}\"))\n",
" \n",
" matches = match_map.get(j, {})\n",
" if matches:\n",
" best_sim = max(matches.values())\n",
" num_matches = len(matches)\n",
" display(Markdown(f\" → Best sim: **{best_sim:.3f}** | Matched elements: **{num_matches}**\"))\n",
" else:\n",
" display(Markdown(f\"No elements passed similarity checks.\"))\n"
]
},
{
"cell_type": "markdown",
"id": "cc37bb7f",
"metadata": {},
"source": [
"### 6. NearestNeighbor Filter\n",
"\n",
"Further prunes via nearestneighbor upper bounds on total matching score.\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "aa9b7a63",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"- Surviving after NN filter: **[3]**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nn_filtered = cand_sel.nn_filter(\n",
" tokenized_R, set(signature), filtered_cands,\n",
" index, threshold=δ, match_map=match_map\n",
")\n",
"display(Markdown(f\"- Surviving after NN filter: **{sorted(nn_filtered)}**\"))\n",
"for j in nn_filtered:\n",
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))\n"
]
},
{
"cell_type": "markdown",
"id": "8638f83a",
"metadata": {},
"source": [
"### 7. Verification\n",
"\n",
"Runs the bipartite maxmatching on the remaining candidates and outputs the final related sets.\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "ebdf20fe",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Final related sets (score ≥ 0.7):"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
" • S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle” → **0.743**"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"verifier = Verifier(δ, contain, jaccard_similarity, sim_thresh=α, reduction=False)\n",
"results = verifier.get_related_sets(tokenized_R, nn_filtered, index)\n",
"\n",
"if results:\n",
" display(Markdown(f\"Final related sets (score ≥ {δ}):\"))\n",
" for j, score in results:\n",
" display(Markdown(f\" • S[{j}]: “{' | '.join(source_sets[j])}” → **{score:.3f}**\"))\n",
"else:\n",
" display(Markdown(\"- No sets passed verification.\"))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "silkmoth_env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

155
docu/experiments/README.md Normal file
View File

@@ -0,0 +1,155 @@
### 🧪 Running the Experiments
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
---
#### 📊 1. Experiment Types
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
- **String Matching (DBLP Publication Titles)**
- **Schema Matching (WebTables)**
- **Inclusion Dependency Discovery (WebTable Columns)**
Exact descriptions can be found in the official paper.
---
#### 📦 2. WebSchema Inclusion Dependency Setup
To run the **WebSchema + Inclusion Dependency** experiments:
1. Download the pre-extracted dataset from
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
2. Place the `.json` files in the `data/webtables/` directory
*(create the folder if it does not exist)*.
---
#### 🚀 3. Running the Experiments
To execute the core experiments from the paper:
```bash
python run.py
```
### 📈 4. Results Overview
We compared our results with those presented in the original SilkMoth paper.
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
All the results can be found in the folder `results`.
The **left** diagrams are from the paper and the **right** are ours.
> 💡 *Recent performance enhancements leverage `scipy`s C-accelerated matching, replacing the original `networkx`-based approach.
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
---
### 🔍 Inclusion Dependency
> **Goal**: Check if each reference set is contained within source sets.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
**Reduction Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 Schema Matching (WebTables)
> **Goal**: Detect related set pairs within a single source set.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 String Matching (DBLP Publication Titles)
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
based on **edit similarity** and **q-gram** tokenization.
> SciPy was used here.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
<img src="results/string_matching/10k-set-size/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
> shows the Filter run alongside the brute-force bipartite matching method without any
> optimization pipeline. The results clearly demonstrate a dramatic improvement
> in runtime efficiency when using SilkMoth.
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
---
### 🔍 Additional: Schema Matching with GitHub WebTables
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoths performance across different domains.
**Left:** Matching with one reference set.
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
The results show no significant difference, indicating consistent behavior across varying datasets.
<p align="center">
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
</p>

Binary file not shown.

After

Width:  |  Height:  |  Size: 125 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

View File

@@ -0,0 +1,64 @@
from experiments.utils import plot_elapsed_times
import csv
import csv
labels = []
elapsed_times = []
def read_csv_add_data(filename, labels, elapsed_times):
with open(filename, newline='') as csvfile:
reader = csv.reader(csvfile)
next(reader) # skip header
times = []
current_label = None
for row in reader:
sim_thresh = float(row[0])
label = row[4]
elapsed = float(row[5])
if sim_thresh == 0.5:
if current_label != label:
# New label group started
if times:
# Save times of previous label if not empty
elapsed_times.append(times)
times = [elapsed]
current_label = label
else:
times.append(elapsed)
# When 4 times collected, append and reset
if len(times) == 4:
elapsed_times.append(times)
times = []
current_label = None
if label not in labels:
labels.append(label)
# In case last label times were not appended
if times:
elapsed_times.append(times)
# Read first CSV
read_csv_add_data('inclusion_dependency/raw_matching_experiment_results.csv', labels, elapsed_times)
# Read second CSV
read_csv_add_data('inclusion_dependency/inclusion_dependency_filter_experiment_results.csv', labels, elapsed_times)
print("Labels:", labels)
print("Elapsed Times:", elapsed_times)
# Then plot
file_name_prefix = "inclusion_dependency_filter_combined_raw"
folder_path = ""
_ = plot_elapsed_times(
related_thresholds=[0.7, 0.75, 0.8, 0.85],
elapsed_times_list=elapsed_times,
fig_text=f"{file_name_prefix} (α = 0.5)",
legend_labels=labels,
file_name=f"{folder_path}{file_name_prefix}_experiment_α=0.5.png"
)

Binary file not shown.

After

Width:  |  Height:  |  Size: 171 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 199 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

BIN
docu/figures/Pipeline.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

151
docu/index.md Normal file
View File

@@ -0,0 +1,151 @@
# 🦋 LSDIPro SS2025
## 📄 [SilkMoth: An Efficient Method for Finding Related Sets](https://doi.org/10.14778/3115404.3115413)
A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery.
---
## 👥 Team Members
- **Andreas Wilms**
- **Sarra Daknou**
- **Amina Iqbal**
- **Jakob Berschneider**
---
## 📊 Experiments & Results
➡️ [**See Experiments**](experiments/README.md)
---
## 🧪 Interactive Demo
Follow our **step-by-step Jupyter Notebook demo** for a hands-on understanding of SilkMoth
📓 [**Open demo_example.ipynb**](demo_example.ipynb)
---
## Table of Contents
- [1. Large Scale Data Integration Project (LSDIPro)](#1-large-scale-data-integration-project-lsdipro)
- [2. What is SilkMoth? 🐛](#2-what-is-silkmoth)
- [3. The Problem 🧩](#3-the-problem)
- [4. SilkMoths Solution 🚀](#4-silkmoths-solution)
- [5. Core Pipeline Steps 🔁](#5-core-pipeline-steps)
- [5.1 Tokenization](#51-tokenization)
- [5.2 Inverted Index Construction](#52-inverted-index-construction)
- [5.3 Signature Generation](#53-signature-generation)
- [5.4 Candidate Selection](#54-candidate-selection)
- [5.5 Refinement Filters](#55-refinement-filters)
- [5.6 Verification via Maximum Matching](#56-verification-via-maximum-matching)
- [6. Modes of Operation 🧪](#6-modes-of-operation-)
- [7. Supported Similarity Functions 📐](#7-supported-similarity-functions-)
- [8. Installing from Source](#8-installing-from-source)
- [9. Experiment Results](#9-experiment-results)
---
## 1. Large Scale Data Integration Project (LSDIPro)
As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python.
The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes.
The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale.
---
## 2. What is SilkMoth?
**SilkMoth** is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar.
This is especially important in **data integration**, **data cleaning**, and **information retrieval**, where messy or inconsistent data is common.
---
## 3. The Problem
Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using **similarity functions** (not just exact matches).
A powerful approach models this as a **bipartite graph** and finds the **maximum matching score** between elements. However, this method is **computationally expensive** (`O(n³)` per pair), making it impractical for large datasets.
---
## 4. SilkMoths Solution
SilkMoth tackles this with a three-step approach:
1. **Signature Generation**: Creates compact signatures for each set, ensuring related sets share signature parts.
2. **Pruning**: Filters out unrelated sets early, reducing candidates.
3. **Verification**: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster.
---
## 5. Core Pipeline Steps
![Figure 1: SILKMOTH Framework Overview](figures/Pipeline.png)
*Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.*
### [5.1 Tokenization](pages/tokenizer.md)
Each element in every set is tokenized based on the selected similarity function:
- **Jaccard Similarity**: Elements are split into whitespace-delimited tokens.
- **Edit Similarity**: Elements are split into overlapping `q`-grams (e.g., 3-grams).
### [5.2 Inverted Index Construction](pages/inverted_index.md)
An **inverted index** is built from the reference set `R` to map each token to a list of `(set, element)` pairs in which it occurs.
This allows fast lookup of candidate sets sharing tokens with a query.
### [5.3 Signature Generation](pages/signature_generator.md)
A **signature** is a subset of tokens selected from each set such that:
- Any related set must share at least one signature token.
- Signature size is minimized to reduce candidate space.
Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly.
### [5.4 Candidate Selection](pages/candidate_selector.md)
For each set `R`, retrieve from the inverted index all sets `S` sharing at least one token with `R`s signature. These become **candidate sets** for further evaluation.
### [5.5 Refinement Filters](pages/candidate_selector.md)
Two filters reduce false positives among candidates:
- **Check Filter**: Uses an upper bound on similarity to eliminate sets below threshold.
- **Nearest Neighbor Filter**: Approximates maximum matching score using nearest neighbor similarity for each element in `R`.
### [5.6 Verification via Maximum Matching](pages/verifier.md)
Compute **maximum weighted bipartite matching** between elements of `R` and `S` for remaining candidates using the similarity function as edge weights.
Sets meeting or exceeding threshold `δ` are considered **related**.
---
## 6. Modes of Operation 🧪
- **Discovery Mode**: Compare all pairs of sets to find all related pairs.
*Use case:* Finding related columns in databases.
- **Search Mode**: Given a reference set, find all related sets.
*Use case:* Schema matching or entity deduplication.
---
## 7. Supported Similarity Functions 📐
- **Jaccard Similarity**
- **Edit Similarity** (Levenshtein-based)
- Optional minimum similarity threshold `α` on element comparisons.
---
## 8. Installing from Source
1. Run `pip install src/` to install
---
## 9. Experiment Results
[📊 See Experiments and Results](experiments/README.md)

View File

@@ -0,0 +1,4 @@
::: silkmoth.candidate_selector
rendering:
show_signature: true
show_source: true

View File

@@ -0,0 +1,4 @@
::: silkmoth.inverted_index
rendering:
show_signature: true
show_source: true

View File

@@ -0,0 +1,4 @@
::: silkmoth.signature_generator
rendering:
show_signature: true
show_source: true

View File

@@ -0,0 +1,4 @@
::: silkmoth.silkmoth_engine
rendering:
show_signature: true
show_source: true

4
docu/pages/tokenizer.md Normal file
View File

@@ -0,0 +1,4 @@
::: silkmoth.tokenizer
rendering:
show_signature: true
show_source: true

4
docu/pages/utils.md Normal file
View File

@@ -0,0 +1,4 @@
::: silkmoth.utils
rendering:
show_signature: true
show_source: true

4
docu/pages/verifier.md Normal file
View File

@@ -0,0 +1,4 @@
::: silkmoth.verifier
rendering:
show_signature: true
show_source: true

20
docu/write_modules.py Normal file
View File

@@ -0,0 +1,20 @@
import glob, os
MODULES = glob.glob("src/silkmoth/*.py")
OUT_DIR = "docu/pages"
os.makedirs(OUT_DIR, exist_ok=True)
for path in MODULES:
name = os.path.splitext(os.path.basename(path))[0]
if name == "__init__":
continue
doc_path = os.path.join(OUT_DIR, f"{name}.md")
with open(doc_path, "w") as f:
f.write("::: silkmoth." + name + "\n")
f.write(" rendering:\n")
f.write(" show_signature: true\n")
f.write(" show_source: true\n")

155
experiments/README.md Normal file
View File

@@ -0,0 +1,155 @@
### 🧪 Running the Experiments
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
---
#### 📊 1. Experiment Types
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
- **String Matching (DBLP Publication Titles)**
- **Schema Matching (WebTables)**
- **Inclusion Dependency Discovery (WebTable Columns)**
Exact descriptions can be found in the official paper.
---
#### 📦 2. WebSchema Inclusion Dependency Setup
To run the **WebSchema + Inclusion Dependency** experiments:
1. Download the pre-extracted dataset from
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
2. Place the `.json` files in the `data/webtables/` directory
*(create the folder if it does not exist)*.
---
#### 🚀 3. Running the Experiments
To execute the core experiments from the paper:
```bash
python run.py
```
### 📈 4. Results Overview
We compared our results with those presented in the original SilkMoth paper.
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
All the results can be found in the folder `results`.
The **left** diagrams are from the paper and the **right** are ours.
> 💡 *Recent performance enhancements leverage `scipy`s C-accelerated matching, replacing the original `networkx`-based approach.
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
---
### 🔍 Inclusion Dependency
> **Goal**: Check if each reference set is contained within source sets.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
**Reduction Comparison**
<p align="center">
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 Schema Matching (WebTables)
> **Goal**: Detect related set pairs within a single source set.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 String Matching (DBLP Publication Titles)
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
based on **edit similarity** and **q-gram** tokenization.
> SciPy was used here.
**Filter Comparison**
<p align="center">
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
<img src="results/string_matching/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
**Signature Comparison**
<p align="center">
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
**Scalability**
<p align="center">
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
</p>
---
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
> shows the Filter run alongside the brute-force bipartite matching method without any
> optimization pipeline. The results clearly demonstrate a dramatic improvement
> in runtime efficiency when using SilkMoth.
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
---
### 🔍 Additional: Schema Matching with GitHub WebTables
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoths performance across different domains.
**Left:** Matching with one reference set.
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
The results show no significant difference, indicating consistent behavior across varying datasets.
<p align="center">
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
</p>

View File

File diff suppressed because it is too large Load Diff

View File

174
experiments/data_loader.py Normal file
View File

@@ -0,0 +1,174 @@
import random
import os
import pandas as pd
from utils import *
class DataLoader:
def __init__(self, data_path):
self.data_path = data_path
self.files = os.listdir(data_path)
def load_webtable_columns_randomized(self, reference_set_amount: int, source_set_amount: int) -> tuple[list, list]:
"""
Get randomized reference sets and source sets of webtable columns.
Reference sets are subsets of the source sets.
Only columns with 4 or more different elements are considered.
Only considering columns with non-numeric values.
Args:
reference_set_amount (int): Number of reference sets to return.
source_set_amount (int): Number of source sets to return.
Returns:
tuple: A tuple containing a list of reference sets and a list of source sets.
"""
# Basic validation of input parameters
if reference_set_amount < 1 or source_set_amount < 2:
raise ValueError("reference_set_amount must be at least 1 and source_set_amount must be at least 2")
if reference_set_amount >= source_set_amount:
raise ValueError("reference_set_amount must be smaller than source_set_amount")
if reference_set_amount > len(self.files):
raise ValueError("reference_set_amount must be smaller than the number of files in data_path")
if source_set_amount > len(self.files):
raise ValueError("source_set_amount must be smaller than the number of files in data_path")
if len(self.files) == 0:
raise ValueError("data_path does not contain any files")
# Randomly select a reference set and source sets
source_set_nums = random.sample(range(len(self.files)), source_set_amount)
# Pick source_set_amount of columns which have at least 4 different elements
source_sets = []
while len(source_sets) < source_set_amount:
# Pick a random number from the source_set_nums
source_set_num = random.choice(source_set_nums)
file_path = os.path.join(self.data_path, self.files[source_set_num])
try:
with open(file_path, 'r', encoding='utf-8') as file:
json_data = json.load(file)
if "relation" in json_data and isinstance(json_data["relation"], list):
# pick random column
col = random.randint(0, len(json_data["relation"]) - 1)
col = json_data["relation"][col]
# Check if the column has at least 4 different elements and contains no numeric values
if len(set(col)) >= 4:
if all(not is_convertible_to_number(value) and len(value) > 0 for value in col):
# Add the column to the source sets
source_sets.append(col)
print(f"Source set number {len(source_sets)} loaded")
except Exception as e:
raise ValueError(f"Error loading JSON file: {e}")
# Randomly select reference sets from the source sets
reference_sets = random.sample(source_sets, reference_set_amount)
return reference_sets, source_sets
def load_webtable_reference_sets_element_restriction(self, source_set: list, element_restriction: int) -> list:
"""
Get a reference set of webtable columns with a specific element restriction.
Restriction is the minimal number of elements allowed in the reference set.
Args:
source_set (list): The source set to use for generating the reference set.
element_restriction (int): The number of elements in the reference set.
Returns:
list: A list of reference sets.
"""
if element_restriction < 1:
raise ValueError("element_restriction must be at least 1")
reference_sets = []
while len(reference_sets) < 1000:
# Randomly select a column from the source set
col = random.choice(source_set)
# Check if the column has at least element_restriction different elements
if len(col) >= element_restriction:
reference_sets.append(col)
print(f"Reference set number {len(reference_sets)} loaded")
return reference_sets
def load_webtable_schemas_randomized(self, set_amount: int) -> list:
if set_amount < 2:
raise ValueError("source_set_amount must be at least 2")
# Random sequence of table numbers
table_nums = random.sample(range(len(self.files)), len(self.files))
schema_sets = []
i = 0
while len(schema_sets) < set_amount and i < len(table_nums):
try:
# Load the schema for the current table number
schema = self.load_single_webtable_schema(table_nums[i])
schema_sets.append(schema)
print(f"Schema set number {len(schema_sets)} loaded")
i += 1
except ValueError as e:
print(f"Skipping table number {table_nums[i]} due to error: {e}")
i += 1
return schema_sets
def load_single_webtable_schema(self, reference_set_num: int) -> list:
# Load the webtable schema for the given reference set number
if reference_set_num < 0 or reference_set_num >= len(self.files):
raise IndexError("reference_set_num is out of range")
# Get the file at the specified position
file_path = os.path.join(self.data_path, self.files[reference_set_num])
# Load and return the JSON content
try:
with open(file_path, 'r', encoding='utf-8') as file:
json_data = json.load(file)
if "relation" in json_data and isinstance(json_data["relation"], list):
schema = [relation[0] for relation in json_data["relation"]]
if len(schema) == 0:
raise ValueError("Schema is empty")
if all(not is_convertible_to_number(col) for col in schema):
# remove "" empty strings from the schema
schema = [col for col in schema if len(col) > 0]
if len(schema) == 0:
raise ValueError("Schema contains only empty strings")
return schema
else:
raise ValueError("Schema contains numeric values or is empty")
else:
raise ValueError("JSON does not contain a valid 'relation' key or it is not a list")
except Exception as e:
raise ValueError(f"Error loading JSON file: {e}")
def load_dblp_titles(self, data_path: str) -> list:
"""
Load DBLP paper titles from a CSV file.
Args:
data_path (str): Path to CSV file containing a column 'title'.
Returns:
list: A list of title strings.
"""
if not os.path.exists(data_path):
raise FileNotFoundError(f"DBLP CSV file not found: {data_path}")
df = pd.read_csv(data_path)
if "title" not in df.columns:
raise ValueError("CSV must contain a 'title' column")
titles = df["title"].dropna().tolist()
return titles

469
experiments/experiments.py Normal file
View File

@@ -0,0 +1,469 @@
import time
from math import floor
from silkmoth.silkmoth_engine import SilkMothEngine
from silkmoth.utils import SigType, edit_similarity, contain, jaccard_similarity
from silkmoth.verifier import Verifier
from silkmoth.tokenizer import Tokenizer
from src.silkmoth.silkmoth_engine import SilkMothEngine
from src.silkmoth.utils import SigType, edit_similarity
from utils import *
def run_experiment_filter_schemes(related_thresholds, similarity_thresholds, labels, source_sets, reference_sets,
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
"""
Parameters
----------
related_thresholds : list[float]
Thresholds for determining relatedness between sets.
similarity_thresholds : list[float]
Thresholds for measuring similarity between sets.
labels : list[str]
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
source_sets : list[]
The sets to be compared against the reference sets or against itself.
reference_sets : list[]
The sets used as the reference for comparison.
sim_metric : callable
The metric function used to evaluate similarity between sets.
sim_func : callable
The function used to calculate similarity scores.
is_search : bool
Flag indicating whether to perform a search operation or discovery.
file_name_prefix : str
Prefix for naming output files generated during the experiment.
folder_path: str
Path to the folder where results will be saved.
"""
# Calculate index time and RAM usage for the SilkMothEngine
in_index_time_start = time.time()
initial_ram = measure_ram_usage()
# Initialize and run the SilkMothEngine
silk_moth_engine = SilkMothEngine(
related_thresh=0,
source_sets=source_sets,
sim_metric=sim_metric,
sim_func=sim_func,
sim_thresh=0,
is_check_filter=False,
is_nn_filter=False,
)
in_index_time_end = time.time()
final_ram = measure_ram_usage()
in_index_elapsed_time = in_index_time_end - in_index_time_start
in_index_ram_usage = final_ram - initial_ram
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
for sim_thresh in similarity_thresholds:
# Check if the similarity function is edit similarity
if sim_func == edit_similarity:
# calc the maximum possible q-gram size based on sim_thresh
upper_bound_q = sim_thresh/(1 - sim_thresh)
q = floor(upper_bound_q)
print(f"Using q = {q} for edit similarity with sim_thresh = {sim_thresh}")
print(f"Rebuilding Inverted Index with q = {q}...")
silk_moth_engine.set_q(q)
elapsed_times_final = []
silk_moth_engine.set_alpha(sim_thresh)
for label in labels:
elapsed_times = []
for idx, related_thresh in enumerate(related_thresholds):
print(
f"\nRunning SilkMoth {file_name_prefix} with α = {sim_thresh}, θ = {related_thresh}, label = {label}")
# checks for filter runs
if label == "CHECK FILTER":
silk_moth_engine.is_check_filter = True
silk_moth_engine.is_nn_filter = False
elif label == "NN FILTER":
silk_moth_engine.is_check_filter = False
silk_moth_engine.is_nn_filter = True
else: # NO FILTER
silk_moth_engine.is_check_filter = False
silk_moth_engine.is_nn_filter = False
# checks for signature scheme runs
if label == SigType.WEIGHTED:
silk_moth_engine.set_signature_type(SigType.WEIGHTED)
elif label == SigType.SKYLINE:
silk_moth_engine.set_signature_type(SigType.SKYLINE)
elif label == SigType.DICHOTOMY:
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
silk_moth_engine.set_related_threshold(related_thresh)
# Measure the time taken to search for related sets
time_start = time.time()
# Used for search to see how many candidates were found and how many were removed
candidates_amount = 0
candidates_after = 0
related_sets_found = 0
if is_search:
for ref_id, ref_set in enumerate(reference_sets):
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
ref_set)
candidates_amount += candidates_amount_temp
candidates_after += candidates_removed_temp
related_sets_found += len(related_sets_temp)
else:
# If not searching, we are discovering sets
silk_moth_engine.discover_sets(source_sets)
time_end = time.time()
elapsed_time = time_end - time_start
elapsed_times.append(elapsed_time)
# Create a new data dictionary for each iteration
if is_search:
data_overall = {
"similarity_threshold": sim_thresh,
"related_threshold": related_thresh,
"reference_set_amount": len(reference_sets),
"source_set_amount": len(source_sets),
"label": label,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
"candidates_amount": candidates_amount,
"candidates_amount_after_filtering": candidates_after,
"related_sets_found": related_sets_found,
}
else:
data_overall = {
"similarity_threshold": sim_thresh,
"related_threshold": related_thresh,
"source_set_amount": len(source_sets),
"label": label,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
}
# Save results to a CSV file
save_experiment_results_to_csv(
results=data_overall,
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
)
elapsed_times_final.append(elapsed_times)
_ = plot_elapsed_times(
related_thresholds=related_thresholds,
elapsed_times_list=elapsed_times_final,
fig_text=f"{file_name_prefix} (α = {sim_thresh})",
legend_labels=labels,
file_name=f"{folder_path}{file_name_prefix}_experiment_α={sim_thresh}.png"
)
def run_reduction_experiment(related_thresholds, similarity_threshold, labels, source_sets, reference_sets,
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
"""
Parameters
----------
related_thresholds : list[float]
Thresholds for determining relatedness between sets.
similarity_threshold : float
Thresholds for measuring similarity between sets.
labels : list[str]
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
source_sets : list[]
The sets to be compared against the reference sets or against itself.
reference_sets : list[]
The sets used as the reference for comparison.
sim_metric : callable
The metric function used to evaluate similarity between sets.
sim_func : callable
The function used to calculate similarity scores.
is_search : bool
Flag indicating whether to perform a search operation or discovery.
file_name_prefix : str
Prefix for naming output files generated during the experiment.
folder_path: str
Path to the folder where results will be saved.
"""
in_index_time_start = time.time()
initial_ram = measure_ram_usage()
# Initialize and run the SilkMothEngine
silk_moth_engine = SilkMothEngine(
related_thresh=0,
source_sets=source_sets,
sim_metric=sim_metric,
sim_func=sim_func,
sim_thresh=similarity_threshold,
is_check_filter=False,
is_nn_filter=False,
)
# use dichotomy signature scheme for this experiment
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
in_index_time_end = time.time()
final_ram = measure_ram_usage()
in_index_elapsed_time = in_index_time_end - in_index_time_start
in_index_ram_usage = final_ram - initial_ram
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
elapsed_times_final = []
for label in labels:
if label == "REDUCTION":
silk_moth_engine.set_reduction(True)
elif label == "NO REDUCTION":
silk_moth_engine.set_reduction(False)
elapsed_times = []
for idx, related_thresh in enumerate(related_thresholds):
print(
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, label = {label}")
silk_moth_engine.set_related_threshold(related_thresh)
# Measure the time taken to search for related sets
time_start = time.time()
# Used for search to see how many candidates were found and how many were removed
candidates_amount = 0
candidates_after = 0
if is_search:
for ref_id, ref_set in enumerate(reference_sets):
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
ref_set)
candidates_amount += candidates_amount_temp
candidates_after += candidates_removed_temp
else:
# If not searching, we are discovering sets
silk_moth_engine.discover_sets(source_sets)
time_end = time.time()
elapsed_time = time_end - time_start
elapsed_times.append(elapsed_time)
# Create a new data dictionary for each iteration
if is_search:
data_overall = {
"similarity_threshold": similarity_threshold,
"related_threshold": related_thresh,
"reference_set_amount": len(reference_sets),
"source_set_amount": len(source_sets),
"label": label,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
"candidates_amount": candidates_amount,
"candidates_amount_after_filtering": candidates_after,
}
else:
data_overall = {
"similarity_threshold": similarity_threshold,
"related_threshold": related_thresh,
"source_set_amount": len(source_sets),
"label": label,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
}
# Save results to a CSV file
save_experiment_results_to_csv(
results=data_overall,
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
)
elapsed_times_final.append(elapsed_times)
_ = plot_elapsed_times(
related_thresholds=related_thresholds,
elapsed_times_list=elapsed_times_final,
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
legend_labels=labels,
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png"
)
def run_scalability_experiment(related_thresholds, similarity_threshold, set_sizes, source_sets, reference_sets,
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
"""
Parameters
----------
related_thresholds : list[float]
Thresholds for determining relatedness between sets.
similarity_threshold : float
Thresholds for measuring similarity between sets.
set_sizes : list[int]
Sizes of the sets to be used in the experiment.
source_sets : list[]
The sets to be compared against the reference sets or against itself.
reference_sets : list[]
The sets used as the reference for comparison.
sim_metric : callable
The metric function used to evaluate similarity between sets.
sim_func : callable
The function used to calculate similarity scores.
is_search : bool
Flag indicating whether to perform a search operation or discovery.
file_name_prefix : str
Prefix for naming output files generated during the experiment.
folder_path: str
Path to the folder where results will be saved.
"""
elapsed_times_final = []
for idx, related_thresh in enumerate(related_thresholds):
elapsed_times = []
for size in set_sizes:
in_index_time_start = time.time()
initial_ram = measure_ram_usage()
# Initialize and run the SilkMothEngine
silk_moth_engine = SilkMothEngine(
related_thresh=0,
source_sets=source_sets[:size],
sim_metric=sim_metric,
sim_func=sim_func,
sim_thresh=similarity_threshold,
is_check_filter=True,
is_nn_filter=True,
)
in_index_time_end = time.time()
final_ram = measure_ram_usage()
in_index_elapsed_time = in_index_time_end - in_index_time_start
in_index_ram_usage = final_ram - initial_ram
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
print(
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, set_size = {size}")
silk_moth_engine.set_related_threshold(related_thresh)
# Measure the time taken to search for related sets
time_start = time.time()
if sim_func == edit_similarity:
# calc the maximum possible q-gram size based on sim_thresh
upper_bound_q = similarity_threshold / (1 - similarity_threshold)
q = floor(upper_bound_q)
print(f"Using q = {q} for edit similarity with sim_thresh = {similarity_threshold}")
print(f"Rebuilding Inverted Index with q = {q}...")
silk_moth_engine.set_q(q)
# Used for search to see how many candidates were found and how many were removed
candidates_amount = 0
candidates_after = 0
if is_search:
for ref_id, ref_set in enumerate(reference_sets):
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
ref_set)
candidates_amount += candidates_amount_temp
candidates_after += candidates_removed_temp
else:
# If not searching, we are discovering sets
silk_moth_engine.discover_sets(source_sets[:size])
time_end = time.time()
elapsed_time = time_end - time_start
elapsed_times.append(elapsed_time)
# Create a new data dictionary for each iteration
if is_search:
data_overall = {
"similarity_threshold": similarity_threshold,
"related_threshold": related_thresh,
"reference_set_amount": len(reference_sets),
"source_set_amount": len(source_sets[:size]),
"set_size": size,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
"candidates_amount": candidates_amount,
"candidates_amount_after_filtering": candidates_after,
}
else:
data_overall = {
"similarity_threshold": similarity_threshold,
"related_threshold": related_thresh,
"source_set_amount": len(source_sets[:size]),
"set_size": size,
"elapsed_time": round(elapsed_time, 3),
"inverted_index_time": round(in_index_elapsed_time, 3),
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
}
# Save results to a CSV file
save_experiment_results_to_csv(
results=data_overall,
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
)
del silk_moth_engine
elapsed_times_final.append(elapsed_times)
# create legend labels based on set sizes
adjusted_legend_labels = [f"θ = {rt}" for rt in related_thresholds]
adjusted_set_sizes = [size / 100_000 for size in set_sizes]
_ = plot_elapsed_times(
related_thresholds=adjusted_set_sizes,
elapsed_times_list=elapsed_times_final,
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
legend_labels=adjusted_legend_labels,
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png",
xlabel="Number of Sets (in 100ks)",
)
def run_matching_without_silkmoth_inc_dep(source_sets, reference_sets, related_thresholds, similarity_threshold, sim_metric, sim_fun , file_name_prefix, folder_path):
tokenizer = Tokenizer(sim_func=sim_fun)
for related_thresh in related_thresholds:
verifier = Verifier(sim_thresh=similarity_threshold, related_thresh=related_thresh,
sim_metric=sim_metric, sim_func=sim_fun, reduction=False)
related_sets = []
time_start = time.time()
for ref in reference_sets:
for source in source_sets:
if len(ref) > len(source):
continue
relatedness = verifier.get_relatedness(tokenizer.tokenize(ref), tokenizer.tokenize(source))
if relatedness >= related_thresh:
related_sets.append((source, relatedness))
time_end = time.time()
elapsed_time = time_end - time_start
data_overall = {
"similarity_threshold": similarity_threshold,
"related_threshold": related_thresh,
"source_set_amount": len(source_sets),
"reference_set_amount": len(reference_sets),
"label": "RAW MATCH",
"elapsed_time": round(elapsed_time, 3),
"matches_found": len(related_sets)
}
# Save results to a CSV file
save_experiment_results_to_csv(
results=data_overall,
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
)

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
0.0,0.7,1000,500000,NO FILTER,1036.548,49.107,7727.559,3006749,3006749,986715
0.0,0.75,1000,500000,NO FILTER,871.225,49.107,7727.559,2673348,2673348,964206
0.0,0.8,1000,500000,NO FILTER,695.528,49.107,7727.559,2273416,2273416,934002
0.0,0.85,1000,500000,NO FILTER,548.878,49.107,7727.559,1907985,1907985,879744
0.0,0.7,1000,500000,CHECK FILTER,980.124,49.107,7727.559,3006749,2852034,986715
0.0,0.75,1000,500000,CHECK FILTER,789.947,49.107,7727.559,2673348,2531660,964206
0.0,0.8,1000,500000,CHECK FILTER,590.707,49.107,7727.559,2273416,2107346,934002
0.0,0.85,1000,500000,CHECK FILTER,427.982,49.107,7727.559,1907985,1728877,879744
0.0,0.7,1000,500000,NN FILTER,533.776,49.107,7727.559,3006749,2547,2535
0.0,0.75,1000,500000,NN FILTER,448.358,49.107,7727.559,2673348,2394,2382
0.0,0.8,1000,500000,NN FILTER,359.112,49.107,7727.559,2273416,1077,1077
0.0,0.85,1000,500000,NN FILTER,268.529,49.107,7727.559,1907985,1037,1037
0.25,0.7,1000,500000,NO FILTER,1038.225,49.107,7727.559,3006749,3006749,984756
0.25,0.75,1000,500000,NO FILTER,866.06,49.107,7727.559,2673348,2673348,963792
0.25,0.8,1000,500000,NO FILTER,693.589,49.107,7727.559,2273416,2273416,933799
0.25,0.85,1000,500000,NO FILTER,545.784,49.107,7727.559,1907985,1907985,878482
0.25,0.7,1000,500000,CHECK FILTER,975.103,49.107,7727.559,3006749,2852028,984756
0.25,0.75,1000,500000,CHECK FILTER,787.87,49.107,7727.559,2673348,2531660,963792
0.25,0.8,1000,500000,CHECK FILTER,589.608,49.107,7727.559,2273416,2107346,933799
0.25,0.85,1000,500000,CHECK FILTER,426.222,49.107,7727.559,1907985,1728877,878482
0.25,0.7,1000,500000,NN FILTER,573.448,49.107,7727.559,3006749,2544,2532
0.25,0.75,1000,500000,NN FILTER,483.1,49.107,7727.559,2673348,2394,2382
0.25,0.8,1000,500000,NN FILTER,385.999,49.107,7727.559,2273416,1077,1077
0.25,0.85,1000,500000,NN FILTER,288.687,49.107,7727.559,1907985,1037,1037
0.5,0.7,1000,500000,NO FILTER,1031.681,49.107,7727.559,3006749,3006749,975892
0.5,0.75,1000,500000,NO FILTER,867.694,49.107,7727.559,2673348,2673348,951793
0.5,0.8,1000,500000,NO FILTER,693.398,49.107,7727.559,2273416,2273416,931599
0.5,0.85,1000,500000,NO FILTER,546.702,49.107,7727.559,1907985,1907985,875833
0.5,0.7,1000,500000,CHECK FILTER,971.71,49.107,7727.559,3006749,2848668,975892
0.5,0.75,1000,500000,CHECK FILTER,783.145,49.107,7727.559,2673348,2529966,951793
0.5,0.8,1000,500000,CHECK FILTER,585.346,49.107,7727.559,2273416,2106355,931599
0.5,0.85,1000,500000,CHECK FILTER,424.629,49.107,7727.559,1907985,1728640,875833
0.5,0.7,1000,500000,NN FILTER,573.046,49.107,7727.559,3006749,2544,2532
0.5,0.75,1000,500000,NN FILTER,482.035,49.107,7727.559,2673348,2394,2382
0.5,0.8,1000,500000,NN FILTER,385.754,49.107,7727.559,2273416,1077,1077
0.5,0.85,1000,500000,NN FILTER,288.24,49.107,7727.559,1907985,1037,1037
0.75,0.7,1000,500000,NO FILTER,1032.605,49.107,7727.559,3006749,3006749,973885
0.75,0.75,1000,500000,NO FILTER,866.218,49.107,7727.559,2673348,2673348,949627
0.75,0.8,1000,500000,NO FILTER,693.19,49.107,7727.559,2273416,2273416,929232
0.75,0.85,1000,500000,NO FILTER,548.07,49.107,7727.559,1907985,1907985,875163
0.75,0.7,1000,500000,CHECK FILTER,960.003,49.107,7727.559,3006749,2838145,973885
0.75,0.75,1000,500000,CHECK FILTER,773.8,49.107,7727.559,2673348,2519134,949627
0.75,0.8,1000,500000,CHECK FILTER,577.671,49.107,7727.559,2273416,2100303,929232
0.75,0.85,1000,500000,CHECK FILTER,417.292,49.107,7727.559,1907985,1725354,875163
0.75,0.7,1000,500000,NN FILTER,544.018,49.107,7727.559,3006749,2544,2532
0.75,0.75,1000,500000,NN FILTER,463.915,49.107,7727.559,2673348,2394,2382
0.75,0.8,1000,500000,NN FILTER,378.184,49.107,7727.559,2273416,1077,1077
0.75,0.85,1000,500000,NN FILTER,285.8,49.107,7727.559,1907985,1040,1040
1 similarity_threshold related_threshold reference_set_amount source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering related_sets_found
2 0.0 0.7 1000 500000 NO FILTER 1036.548 49.107 7727.559 3006749 3006749 986715
3 0.0 0.75 1000 500000 NO FILTER 871.225 49.107 7727.559 2673348 2673348 964206
4 0.0 0.8 1000 500000 NO FILTER 695.528 49.107 7727.559 2273416 2273416 934002
5 0.0 0.85 1000 500000 NO FILTER 548.878 49.107 7727.559 1907985 1907985 879744
6 0.0 0.7 1000 500000 CHECK FILTER 980.124 49.107 7727.559 3006749 2852034 986715
7 0.0 0.75 1000 500000 CHECK FILTER 789.947 49.107 7727.559 2673348 2531660 964206
8 0.0 0.8 1000 500000 CHECK FILTER 590.707 49.107 7727.559 2273416 2107346 934002
9 0.0 0.85 1000 500000 CHECK FILTER 427.982 49.107 7727.559 1907985 1728877 879744
10 0.0 0.7 1000 500000 NN FILTER 533.776 49.107 7727.559 3006749 2547 2535
11 0.0 0.75 1000 500000 NN FILTER 448.358 49.107 7727.559 2673348 2394 2382
12 0.0 0.8 1000 500000 NN FILTER 359.112 49.107 7727.559 2273416 1077 1077
13 0.0 0.85 1000 500000 NN FILTER 268.529 49.107 7727.559 1907985 1037 1037
14 0.25 0.7 1000 500000 NO FILTER 1038.225 49.107 7727.559 3006749 3006749 984756
15 0.25 0.75 1000 500000 NO FILTER 866.06 49.107 7727.559 2673348 2673348 963792
16 0.25 0.8 1000 500000 NO FILTER 693.589 49.107 7727.559 2273416 2273416 933799
17 0.25 0.85 1000 500000 NO FILTER 545.784 49.107 7727.559 1907985 1907985 878482
18 0.25 0.7 1000 500000 CHECK FILTER 975.103 49.107 7727.559 3006749 2852028 984756
19 0.25 0.75 1000 500000 CHECK FILTER 787.87 49.107 7727.559 2673348 2531660 963792
20 0.25 0.8 1000 500000 CHECK FILTER 589.608 49.107 7727.559 2273416 2107346 933799
21 0.25 0.85 1000 500000 CHECK FILTER 426.222 49.107 7727.559 1907985 1728877 878482
22 0.25 0.7 1000 500000 NN FILTER 573.448 49.107 7727.559 3006749 2544 2532
23 0.25 0.75 1000 500000 NN FILTER 483.1 49.107 7727.559 2673348 2394 2382
24 0.25 0.8 1000 500000 NN FILTER 385.999 49.107 7727.559 2273416 1077 1077
25 0.25 0.85 1000 500000 NN FILTER 288.687 49.107 7727.559 1907985 1037 1037
26 0.5 0.7 1000 500000 NO FILTER 1031.681 49.107 7727.559 3006749 3006749 975892
27 0.5 0.75 1000 500000 NO FILTER 867.694 49.107 7727.559 2673348 2673348 951793
28 0.5 0.8 1000 500000 NO FILTER 693.398 49.107 7727.559 2273416 2273416 931599
29 0.5 0.85 1000 500000 NO FILTER 546.702 49.107 7727.559 1907985 1907985 875833
30 0.5 0.7 1000 500000 CHECK FILTER 971.71 49.107 7727.559 3006749 2848668 975892
31 0.5 0.75 1000 500000 CHECK FILTER 783.145 49.107 7727.559 2673348 2529966 951793
32 0.5 0.8 1000 500000 CHECK FILTER 585.346 49.107 7727.559 2273416 2106355 931599
33 0.5 0.85 1000 500000 CHECK FILTER 424.629 49.107 7727.559 1907985 1728640 875833
34 0.5 0.7 1000 500000 NN FILTER 573.046 49.107 7727.559 3006749 2544 2532
35 0.5 0.75 1000 500000 NN FILTER 482.035 49.107 7727.559 2673348 2394 2382
36 0.5 0.8 1000 500000 NN FILTER 385.754 49.107 7727.559 2273416 1077 1077
37 0.5 0.85 1000 500000 NN FILTER 288.24 49.107 7727.559 1907985 1037 1037
38 0.75 0.7 1000 500000 NO FILTER 1032.605 49.107 7727.559 3006749 3006749 973885
39 0.75 0.75 1000 500000 NO FILTER 866.218 49.107 7727.559 2673348 2673348 949627
40 0.75 0.8 1000 500000 NO FILTER 693.19 49.107 7727.559 2273416 2273416 929232
41 0.75 0.85 1000 500000 NO FILTER 548.07 49.107 7727.559 1907985 1907985 875163
42 0.75 0.7 1000 500000 CHECK FILTER 960.003 49.107 7727.559 3006749 2838145 973885
43 0.75 0.75 1000 500000 CHECK FILTER 773.8 49.107 7727.559 2673348 2519134 949627
44 0.75 0.8 1000 500000 CHECK FILTER 577.671 49.107 7727.559 2273416 2100303 929232
45 0.75 0.85 1000 500000 CHECK FILTER 417.292 49.107 7727.559 1907985 1725354 875163
46 0.75 0.7 1000 500000 NN FILTER 544.018 49.107 7727.559 3006749 2544 2532
47 0.75 0.75 1000 500000 NN FILTER 463.915 49.107 7727.559 2673348 2394 2382
48 0.75 0.8 1000 500000 NN FILTER 378.184 49.107 7727.559 2273416 1077 1077
49 0.75 0.85 1000 500000 NN FILTER 285.8 49.107 7727.559 1907985 1040 1040

Binary file not shown.

After

Width:  |  Height:  |  Size: 125 KiB

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
0.0,0.7,200,500000,NO FILTER,6753.593,49.277,7720.887,622080,622080,233513
0.0,0.75,200,500000,NO FILTER,6812.967,49.277,7720.887,575078,575078,223644
0.0,0.8,200,500000,NO FILTER,4953.635,49.277,7720.887,479650,479650,221376
0.0,0.85,200,500000,NO FILTER,4212.413,49.277,7720.887,423078,423078,196944
0.0,0.7,200,500000,CHECK FILTER,3835.233,49.277,7720.887,622080,589307,233513
0.0,0.75,200,500000,CHECK FILTER,3348.061,49.277,7720.887,575078,549687,223644
0.0,0.8,200,500000,CHECK FILTER,2414.995,49.277,7720.887,479650,438680,221376
0.0,0.85,200,500000,CHECK FILTER,1874.261,49.277,7720.887,423078,393028,196944
0.0,0.7,200,500000,NN FILTER,126.601,49.277,7720.887,622080,615,603
0.0,0.75,200,500000,NN FILTER,108.886,49.277,7720.887,575078,332,320
0.0,0.8,200,500000,NN FILTER,80.436,49.277,7720.887,479650,1,1
0.0,0.85,200,500000,NN FILTER,59.824,49.277,7720.887,423078,1,1
0.25,0.7,200,500000,NO FILTER,2191.216,49.277,7720.887,622080,622080,232290
0.25,0.75,200,500000,NO FILTER,1915.087,49.277,7720.887,575078,575078,223444
0.25,0.8,200,500000,NO FILTER,1544.113,49.277,7720.887,479650,479650,221284
0.25,0.85,200,500000,NO FILTER,1354.29,49.277,7720.887,423078,423078,196116
0.25,0.7,200,500000,CHECK FILTER,1809.643,49.277,7720.887,622080,589307,232290
0.25,0.75,200,500000,CHECK FILTER,1548.963,49.277,7720.887,575078,549687,223444
0.25,0.8,200,500000,CHECK FILTER,1277.618,49.277,7720.887,479650,438680,221284
0.25,0.85,200,500000,CHECK FILTER,1111.088,49.277,7720.887,423078,393028,196116
0.25,0.7,200,500000,NN FILTER,131.183,49.277,7720.887,622080,615,603
0.25,0.75,200,500000,NN FILTER,114.192,49.277,7720.887,575078,332,320
0.25,0.8,200,500000,NN FILTER,84.253,49.277,7720.887,479650,1,1
0.25,0.85,200,500000,NN FILTER,62.864,49.277,7720.887,423078,1,1
0.5,0.7,200,500000,NO FILTER,1682.409,49.277,7720.887,622080,622080,230903
0.5,0.75,200,500000,NO FILTER,1491.797,49.277,7720.887,575078,575078,222613
0.5,0.8,200,500000,NO FILTER,1250.727,49.277,7720.887,479650,479650,219875
0.5,0.85,200,500000,NO FILTER,1083.762,49.277,7720.887,423078,423078,195759
0.5,0.7,200,500000,CHECK FILTER,1436.208,49.277,7720.887,622080,588701,230903
0.5,0.75,200,500000,CHECK FILTER,1250.22,49.277,7720.887,575078,549178,222613
0.5,0.8,200,500000,CHECK FILTER,1023.904,49.277,7720.887,479650,438258,219875
0.5,0.85,200,500000,CHECK FILTER,893.938,49.277,7720.887,423078,392937,195759
0.5,0.7,200,500000,NN FILTER,129.51,49.277,7720.887,622080,615,603
0.5,0.75,200,500000,NN FILTER,112.158,49.277,7720.887,575078,332,320
0.5,0.8,200,500000,NN FILTER,83.434,49.277,7720.887,479650,1,1
0.5,0.85,200,500000,NN FILTER,62.648,49.277,7720.887,423078,1,1
0.75,0.7,200,500000,NO FILTER,1447.675,49.277,7720.887,622080,622080,230497
0.75,0.75,200,500000,NO FILTER,1270.052,49.277,7720.887,575078,575078,222063
0.75,0.8,200,500000,NO FILTER,1039.89,49.277,7720.887,479650,479650,219411
0.75,0.85,200,500000,NO FILTER,879.273,49.277,7720.887,423078,423078,195601
0.75,0.7,200,500000,CHECK FILTER,1193.541,49.277,7720.887,622080,586297,230497
0.75,0.75,200,500000,CHECK FILTER,1023.672,49.277,7720.887,575078,546701,222063
0.75,0.8,200,500000,CHECK FILTER,825.541,49.277,7720.887,479650,436782,219411
0.75,0.85,200,500000,CHECK FILTER,704.52,49.277,7720.887,423078,391809,195601
0.75,0.7,200,500000,NN FILTER,120.522,49.277,7720.887,622080,615,603
0.75,0.75,200,500000,NN FILTER,107.657,49.277,7720.887,575078,332,320
0.75,0.8,200,500000,NN FILTER,78.897,49.277,7720.887,479650,1,1
0.75,0.85,200,500000,NN FILTER,57.66,49.277,7720.887,423078,1,1
1 similarity_threshold related_threshold reference_set_amount source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering related_sets_found
2 0.0 0.7 200 500000 NO FILTER 6753.593 49.277 7720.887 622080 622080 233513
3 0.0 0.75 200 500000 NO FILTER 6812.967 49.277 7720.887 575078 575078 223644
4 0.0 0.8 200 500000 NO FILTER 4953.635 49.277 7720.887 479650 479650 221376
5 0.0 0.85 200 500000 NO FILTER 4212.413 49.277 7720.887 423078 423078 196944
6 0.0 0.7 200 500000 CHECK FILTER 3835.233 49.277 7720.887 622080 589307 233513
7 0.0 0.75 200 500000 CHECK FILTER 3348.061 49.277 7720.887 575078 549687 223644
8 0.0 0.8 200 500000 CHECK FILTER 2414.995 49.277 7720.887 479650 438680 221376
9 0.0 0.85 200 500000 CHECK FILTER 1874.261 49.277 7720.887 423078 393028 196944
10 0.0 0.7 200 500000 NN FILTER 126.601 49.277 7720.887 622080 615 603
11 0.0 0.75 200 500000 NN FILTER 108.886 49.277 7720.887 575078 332 320
12 0.0 0.8 200 500000 NN FILTER 80.436 49.277 7720.887 479650 1 1
13 0.0 0.85 200 500000 NN FILTER 59.824 49.277 7720.887 423078 1 1
14 0.25 0.7 200 500000 NO FILTER 2191.216 49.277 7720.887 622080 622080 232290
15 0.25 0.75 200 500000 NO FILTER 1915.087 49.277 7720.887 575078 575078 223444
16 0.25 0.8 200 500000 NO FILTER 1544.113 49.277 7720.887 479650 479650 221284
17 0.25 0.85 200 500000 NO FILTER 1354.29 49.277 7720.887 423078 423078 196116
18 0.25 0.7 200 500000 CHECK FILTER 1809.643 49.277 7720.887 622080 589307 232290
19 0.25 0.75 200 500000 CHECK FILTER 1548.963 49.277 7720.887 575078 549687 223444
20 0.25 0.8 200 500000 CHECK FILTER 1277.618 49.277 7720.887 479650 438680 221284
21 0.25 0.85 200 500000 CHECK FILTER 1111.088 49.277 7720.887 423078 393028 196116
22 0.25 0.7 200 500000 NN FILTER 131.183 49.277 7720.887 622080 615 603
23 0.25 0.75 200 500000 NN FILTER 114.192 49.277 7720.887 575078 332 320
24 0.25 0.8 200 500000 NN FILTER 84.253 49.277 7720.887 479650 1 1
25 0.25 0.85 200 500000 NN FILTER 62.864 49.277 7720.887 423078 1 1
26 0.5 0.7 200 500000 NO FILTER 1682.409 49.277 7720.887 622080 622080 230903
27 0.5 0.75 200 500000 NO FILTER 1491.797 49.277 7720.887 575078 575078 222613
28 0.5 0.8 200 500000 NO FILTER 1250.727 49.277 7720.887 479650 479650 219875
29 0.5 0.85 200 500000 NO FILTER 1083.762 49.277 7720.887 423078 423078 195759
30 0.5 0.7 200 500000 CHECK FILTER 1436.208 49.277 7720.887 622080 588701 230903
31 0.5 0.75 200 500000 CHECK FILTER 1250.22 49.277 7720.887 575078 549178 222613
32 0.5 0.8 200 500000 CHECK FILTER 1023.904 49.277 7720.887 479650 438258 219875
33 0.5 0.85 200 500000 CHECK FILTER 893.938 49.277 7720.887 423078 392937 195759
34 0.5 0.7 200 500000 NN FILTER 129.51 49.277 7720.887 622080 615 603
35 0.5 0.75 200 500000 NN FILTER 112.158 49.277 7720.887 575078 332 320
36 0.5 0.8 200 500000 NN FILTER 83.434 49.277 7720.887 479650 1 1
37 0.5 0.85 200 500000 NN FILTER 62.648 49.277 7720.887 423078 1 1
38 0.75 0.7 200 500000 NO FILTER 1447.675 49.277 7720.887 622080 622080 230497
39 0.75 0.75 200 500000 NO FILTER 1270.052 49.277 7720.887 575078 575078 222063
40 0.75 0.8 200 500000 NO FILTER 1039.89 49.277 7720.887 479650 479650 219411
41 0.75 0.85 200 500000 NO FILTER 879.273 49.277 7720.887 423078 423078 195601
42 0.75 0.7 200 500000 CHECK FILTER 1193.541 49.277 7720.887 622080 586297 230497
43 0.75 0.75 200 500000 CHECK FILTER 1023.672 49.277 7720.887 575078 546701 222063
44 0.75 0.8 200 500000 CHECK FILTER 825.541 49.277 7720.887 479650 436782 219411
45 0.75 0.85 200 500000 CHECK FILTER 704.52 49.277 7720.887 423078 391809 195601
46 0.75 0.7 200 500000 NN FILTER 120.522 49.277 7720.887 622080 615 603
47 0.75 0.75 200 500000 NN FILTER 107.657 49.277 7720.887 575078 332 320
48 0.75 0.8 200 500000 NN FILTER 78.897 49.277 7720.887 479650 1 1
49 0.75 0.85 200 500000 NN FILTER 57.66 49.277 7720.887 423078 1 1

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 149 KiB

View File

@@ -0,0 +1,2 @@
experiment name,elem/set,tokens/elem
Inclusion Dependency,17.81003,25.41035090901026
1 experiment name elem/set tokens/elem
2 Inclusion Dependency 17.81003 25.41035090901026

View File

@@ -0,0 +1,17 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
0.0,0.7,200,500000,REDUCTION,6283.871,45.782,7700.914,622080,622080
0.0,0.75,200,500000,REDUCTION,5651.069,45.782,7700.914,575078,575078
0.0,0.8,200,500000,REDUCTION,4170.768,45.782,7700.914,479650,479650
0.0,0.85,200,500000,REDUCTION,3514.723,45.782,7700.914,423078,423078
0.0,0.7,200,500000,NO REDUCTION,6771.001,45.782,7700.914,622080,622080
0.0,0.75,200,500000,NO REDUCTION,6117.305,45.782,7700.914,575078,575078
0.0,0.8,200,500000,NO REDUCTION,4573.585,45.782,7700.914,479650,479650
0.0,0.85,200,500000,NO REDUCTION,3894.681,45.782,7700.914,423078,423078
0.0,0.7,200,500000,REDUCTION,6142.242,49.376,7721.383,622080,622080
0.0,0.75,200,500000,REDUCTION,5495.346,49.376,7721.383,575078,575078
0.0,0.8,200,500000,REDUCTION,4061.815,49.376,7721.383,479650,479650
0.0,0.85,200,500000,REDUCTION,3429.474,49.376,7721.383,423078,423078
0.0,0.7,200,500000,NO REDUCTION,6622.959,49.376,7721.383,622080,622080
0.0,0.75,200,500000,NO REDUCTION,5960.971,49.376,7721.383,575078,575078
0.0,0.8,200,500000,NO REDUCTION,4489.11,49.376,7721.383,479650,479650
0.0,0.85,200,500000,NO REDUCTION,3794.505,49.376,7721.383,423078,423078
1 similarity_threshold related_threshold reference_set_amount source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering
2 0.0 0.7 200 500000 REDUCTION 6283.871 45.782 7700.914 622080 622080
3 0.0 0.75 200 500000 REDUCTION 5651.069 45.782 7700.914 575078 575078
4 0.0 0.8 200 500000 REDUCTION 4170.768 45.782 7700.914 479650 479650
5 0.0 0.85 200 500000 REDUCTION 3514.723 45.782 7700.914 423078 423078
6 0.0 0.7 200 500000 NO REDUCTION 6771.001 45.782 7700.914 622080 622080
7 0.0 0.75 200 500000 NO REDUCTION 6117.305 45.782 7700.914 575078 575078
8 0.0 0.8 200 500000 NO REDUCTION 4573.585 45.782 7700.914 479650 479650
9 0.0 0.85 200 500000 NO REDUCTION 3894.681 45.782 7700.914 423078 423078
10 0.0 0.7 200 500000 REDUCTION 6142.242 49.376 7721.383 622080 622080
11 0.0 0.75 200 500000 REDUCTION 5495.346 49.376 7721.383 575078 575078
12 0.0 0.8 200 500000 REDUCTION 4061.815 49.376 7721.383 479650 479650
13 0.0 0.85 200 500000 REDUCTION 3429.474 49.376 7721.383 423078 423078
14 0.0 0.7 200 500000 NO REDUCTION 6622.959 49.376 7721.383 622080 622080
15 0.0 0.75 200 500000 NO REDUCTION 5960.971 49.376 7721.383 575078 575078
16 0.0 0.8 200 500000 NO REDUCTION 4489.11 49.376 7721.383 479650 479650
17 0.0 0.85 200 500000 NO REDUCTION 3794.505 49.376 7721.383 423078 423078

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

View File

@@ -0,0 +1,21 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
0.5,0.7,200,100000,100000,69.222,11.405,1554.535,134576,46830
0.5,0.7,200,200000,200000,134.718,23.409,1659.543,254379,93573
0.5,0.7,200,300000,300000,206.136,32.782,1791.512,373007,139377
0.5,0.7,200,400000,400000,275.559,51.827,2040.961,499998,186205
0.5,0.7,200,500000,500000,353.944,51.169,2027.262,622080,233091
0.5,0.75,200,100000,100000,64.988,5.539,0.254,124611,45115
0.5,0.75,200,200000,200000,126.721,24.159,192.152,236137,90048
0.5,0.75,200,300000,300000,193.126,32.91,2217.562,347108,134199
0.5,0.75,200,400000,400000,259.254,50.945,1535.723,462815,179223
0.5,0.75,200,500000,500000,328.0,59.734,2526.176,575078,224315
0.5,0.8,200,100000,100000,59.984,5.544,0.77,104812,44549
0.5,0.8,200,200000,200000,123.595,23.419,-229.445,202489,88907
0.5,0.8,200,300000,300000,183.55,37.277,2302.273,300462,132525
0.5,0.8,200,400000,400000,239.431,45.86,1268.406,386895,176985
0.5,0.8,200,500000,500000,311.525,58.657,2716.348,479650,221057
0.5,0.85,200,100000,100000,56.371,9.486,-151.641,87451,39657
0.5,0.85,200,200000,200000,108.674,23.698,-889.457,171938,79056
0.5,0.85,200,300000,300000,164.616,33.799,2748.523,251392,117969
0.5,0.85,200,400000,400000,220.908,45.263,805.023,331901,157572
0.5,0.85,200,500000,500000,281.56,65.197,3474.547,423078,197145
1 similarity_threshold related_threshold reference_set_amount source_set_amount set_size elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering
2 0.5 0.7 200 100000 100000 69.222 11.405 1554.535 134576 46830
3 0.5 0.7 200 200000 200000 134.718 23.409 1659.543 254379 93573
4 0.5 0.7 200 300000 300000 206.136 32.782 1791.512 373007 139377
5 0.5 0.7 200 400000 400000 275.559 51.827 2040.961 499998 186205
6 0.5 0.7 200 500000 500000 353.944 51.169 2027.262 622080 233091
7 0.5 0.75 200 100000 100000 64.988 5.539 0.254 124611 45115
8 0.5 0.75 200 200000 200000 126.721 24.159 192.152 236137 90048
9 0.5 0.75 200 300000 300000 193.126 32.91 2217.562 347108 134199
10 0.5 0.75 200 400000 400000 259.254 50.945 1535.723 462815 179223
11 0.5 0.75 200 500000 500000 328.0 59.734 2526.176 575078 224315
12 0.5 0.8 200 100000 100000 59.984 5.544 0.77 104812 44549
13 0.5 0.8 200 200000 200000 123.595 23.419 -229.445 202489 88907
14 0.5 0.8 200 300000 300000 183.55 37.277 2302.273 300462 132525
15 0.5 0.8 200 400000 400000 239.431 45.86 1268.406 386895 176985
16 0.5 0.8 200 500000 500000 311.525 58.657 2716.348 479650 221057
17 0.5 0.85 200 100000 100000 56.371 9.486 -151.641 87451 39657
18 0.5 0.85 200 200000 200000 108.674 23.698 -889.457 171938 79056
19 0.5 0.85 200 300000 300000 164.616 33.799 2748.523 251392 117969
20 0.5 0.85 200 400000 400000 220.908 45.263 805.023 331901 157572
21 0.5 0.85 200 500000 500000 281.56 65.197 3474.547 423078 197145

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
0.0,0.7,200,500000,SigType.WEIGHTED,6915.71,47.599,7701.59,622080,622080
0.0,0.75,200,500000,SigType.WEIGHTED,6230.769,47.599,7701.59,575078,575078
0.0,0.8,200,500000,SigType.WEIGHTED,4633.178,47.599,7701.59,479650,479650
0.0,0.85,200,500000,SigType.WEIGHTED,3948.011,47.599,7701.59,423078,423078
0.0,0.7,200,500000,SigType.SKYLINE,6839.554,47.599,7701.59,622080,622080
0.0,0.75,200,500000,SigType.SKYLINE,6156.19,47.599,7701.59,575078,575078
0.0,0.8,200,500000,SigType.SKYLINE,4601.987,47.599,7701.59,479650,479650
0.0,0.85,200,500000,SigType.SKYLINE,3921.286,47.599,7701.59,423078,423078
0.0,0.7,200,500000,SigType.DICHOTOMY,6824.442,47.599,7701.59,622080,622080
0.0,0.75,200,500000,SigType.DICHOTOMY,6158.089,47.599,7701.59,575078,575078
0.0,0.8,200,500000,SigType.DICHOTOMY,4601.877,47.599,7701.59,479650,479650
0.0,0.85,200,500000,SigType.DICHOTOMY,3923.695,47.599,7701.59,423078,423078
0.25,0.7,200,500000,SigType.WEIGHTED,1990.666,47.599,7701.59,622080,622080
0.25,0.75,200,500000,SigType.WEIGHTED,1722.451,47.599,7701.59,575078,575078
0.25,0.8,200,500000,SigType.WEIGHTED,1438.235,47.599,7701.59,479650,479650
0.25,0.85,200,500000,SigType.WEIGHTED,1264.852,47.599,7701.59,423078,423078
0.25,0.7,200,500000,SigType.SKYLINE,1989.546,47.599,7701.59,622080,622080
0.25,0.75,200,500000,SigType.SKYLINE,1719.169,47.599,7701.59,575078,575078
0.25,0.8,200,500000,SigType.SKYLINE,1440.077,47.599,7701.59,479650,479650
0.25,0.85,200,500000,SigType.SKYLINE,1267.701,47.599,7701.59,423078,423078
0.25,0.7,200,500000,SigType.DICHOTOMY,2046.949,47.599,7701.59,622270,622270
0.25,0.75,200,500000,SigType.DICHOTOMY,1966.499,47.599,7701.59,575268,575268
0.25,0.8,200,500000,SigType.DICHOTOMY,1485.458,47.599,7701.59,479650,479650
0.25,0.85,200,500000,SigType.DICHOTOMY,1436.847,47.599,7701.59,423078,423078
0.5,0.7,200,500000,SigType.WEIGHTED,1767.439,47.599,7701.59,622080,622080
0.5,0.75,200,500000,SigType.WEIGHTED,1565.259,47.599,7701.59,575078,575078
0.5,0.8,200,500000,SigType.WEIGHTED,1160.579,47.599,7701.59,479650,479650
0.5,0.85,200,500000,SigType.WEIGHTED,1014.452,47.599,7701.59,423078,423078
0.5,0.7,200,500000,SigType.SKYLINE,1589.081,47.599,7701.59,622054,622054
0.5,0.75,200,500000,SigType.SKYLINE,1393.117,47.599,7701.59,575050,575050
0.5,0.8,200,500000,SigType.SKYLINE,1154.931,47.599,7701.59,479622,479622
0.5,0.85,200,500000,SigType.SKYLINE,1025.061,47.599,7701.59,423078,423078
0.5,0.7,200,500000,SigType.DICHOTOMY,2777.528,47.599,7701.59,936785,936785
0.5,0.75,200,500000,SigType.DICHOTOMY,2340.389,47.599,7701.59,888736,888736
0.5,0.8,200,500000,SigType.DICHOTOMY,1678.145,47.599,7701.59,673929,673929
0.5,0.85,200,500000,SigType.DICHOTOMY,1374.518,47.599,7701.59,517483,517483
0.75,0.7,200,500000,SigType.WEIGHTED,1354.402,47.599,7701.59,622080,622080
0.75,0.75,200,500000,SigType.WEIGHTED,1187.603,47.599,7701.59,575078,575078
0.75,0.8,200,500000,SigType.WEIGHTED,971.469,47.599,7701.59,479650,479650
0.75,0.85,200,500000,SigType.WEIGHTED,822.075,47.599,7701.59,423078,423078
0.75,0.7,200,500000,SigType.SKYLINE,1303.676,47.599,7701.59,594466,594466
0.75,0.75,200,500000,SigType.SKYLINE,1152.405,47.599,7701.59,560020,560020
0.75,0.8,200,500000,SigType.SKYLINE,932.283,47.599,7701.59,467458,467458
0.75,0.85,200,500000,SigType.SKYLINE,816.709,47.599,7701.59,420962,420962
0.75,0.7,200,500000,SigType.DICHOTOMY,5710.524,47.599,7701.59,2410732,2410732
0.75,0.75,200,500000,SigType.DICHOTOMY,5072.603,47.599,7701.59,2145096,2145096
0.75,0.8,200,500000,SigType.DICHOTOMY,4403.341,47.599,7701.59,1739362,1739362
0.75,0.85,200,500000,SigType.DICHOTOMY,2735.424,47.599,7701.59,1078937,1078937
1 similarity_threshold related_threshold reference_set_amount source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering
2 0.0 0.7 200 500000 SigType.WEIGHTED 6915.71 47.599 7701.59 622080 622080
3 0.0 0.75 200 500000 SigType.WEIGHTED 6230.769 47.599 7701.59 575078 575078
4 0.0 0.8 200 500000 SigType.WEIGHTED 4633.178 47.599 7701.59 479650 479650
5 0.0 0.85 200 500000 SigType.WEIGHTED 3948.011 47.599 7701.59 423078 423078
6 0.0 0.7 200 500000 SigType.SKYLINE 6839.554 47.599 7701.59 622080 622080
7 0.0 0.75 200 500000 SigType.SKYLINE 6156.19 47.599 7701.59 575078 575078
8 0.0 0.8 200 500000 SigType.SKYLINE 4601.987 47.599 7701.59 479650 479650
9 0.0 0.85 200 500000 SigType.SKYLINE 3921.286 47.599 7701.59 423078 423078
10 0.0 0.7 200 500000 SigType.DICHOTOMY 6824.442 47.599 7701.59 622080 622080
11 0.0 0.75 200 500000 SigType.DICHOTOMY 6158.089 47.599 7701.59 575078 575078
12 0.0 0.8 200 500000 SigType.DICHOTOMY 4601.877 47.599 7701.59 479650 479650
13 0.0 0.85 200 500000 SigType.DICHOTOMY 3923.695 47.599 7701.59 423078 423078
14 0.25 0.7 200 500000 SigType.WEIGHTED 1990.666 47.599 7701.59 622080 622080
15 0.25 0.75 200 500000 SigType.WEIGHTED 1722.451 47.599 7701.59 575078 575078
16 0.25 0.8 200 500000 SigType.WEIGHTED 1438.235 47.599 7701.59 479650 479650
17 0.25 0.85 200 500000 SigType.WEIGHTED 1264.852 47.599 7701.59 423078 423078
18 0.25 0.7 200 500000 SigType.SKYLINE 1989.546 47.599 7701.59 622080 622080
19 0.25 0.75 200 500000 SigType.SKYLINE 1719.169 47.599 7701.59 575078 575078
20 0.25 0.8 200 500000 SigType.SKYLINE 1440.077 47.599 7701.59 479650 479650
21 0.25 0.85 200 500000 SigType.SKYLINE 1267.701 47.599 7701.59 423078 423078
22 0.25 0.7 200 500000 SigType.DICHOTOMY 2046.949 47.599 7701.59 622270 622270
23 0.25 0.75 200 500000 SigType.DICHOTOMY 1966.499 47.599 7701.59 575268 575268
24 0.25 0.8 200 500000 SigType.DICHOTOMY 1485.458 47.599 7701.59 479650 479650
25 0.25 0.85 200 500000 SigType.DICHOTOMY 1436.847 47.599 7701.59 423078 423078
26 0.5 0.7 200 500000 SigType.WEIGHTED 1767.439 47.599 7701.59 622080 622080
27 0.5 0.75 200 500000 SigType.WEIGHTED 1565.259 47.599 7701.59 575078 575078
28 0.5 0.8 200 500000 SigType.WEIGHTED 1160.579 47.599 7701.59 479650 479650
29 0.5 0.85 200 500000 SigType.WEIGHTED 1014.452 47.599 7701.59 423078 423078
30 0.5 0.7 200 500000 SigType.SKYLINE 1589.081 47.599 7701.59 622054 622054
31 0.5 0.75 200 500000 SigType.SKYLINE 1393.117 47.599 7701.59 575050 575050
32 0.5 0.8 200 500000 SigType.SKYLINE 1154.931 47.599 7701.59 479622 479622
33 0.5 0.85 200 500000 SigType.SKYLINE 1025.061 47.599 7701.59 423078 423078
34 0.5 0.7 200 500000 SigType.DICHOTOMY 2777.528 47.599 7701.59 936785 936785
35 0.5 0.75 200 500000 SigType.DICHOTOMY 2340.389 47.599 7701.59 888736 888736
36 0.5 0.8 200 500000 SigType.DICHOTOMY 1678.145 47.599 7701.59 673929 673929
37 0.5 0.85 200 500000 SigType.DICHOTOMY 1374.518 47.599 7701.59 517483 517483
38 0.75 0.7 200 500000 SigType.WEIGHTED 1354.402 47.599 7701.59 622080 622080
39 0.75 0.75 200 500000 SigType.WEIGHTED 1187.603 47.599 7701.59 575078 575078
40 0.75 0.8 200 500000 SigType.WEIGHTED 971.469 47.599 7701.59 479650 479650
41 0.75 0.85 200 500000 SigType.WEIGHTED 822.075 47.599 7701.59 423078 423078
42 0.75 0.7 200 500000 SigType.SKYLINE 1303.676 47.599 7701.59 594466 594466
43 0.75 0.75 200 500000 SigType.SKYLINE 1152.405 47.599 7701.59 560020 560020
44 0.75 0.8 200 500000 SigType.SKYLINE 932.283 47.599 7701.59 467458 467458
45 0.75 0.85 200 500000 SigType.SKYLINE 816.709 47.599 7701.59 420962 420962
46 0.75 0.7 200 500000 SigType.DICHOTOMY 5710.524 47.599 7701.59 2410732 2410732
47 0.75 0.75 200 500000 SigType.DICHOTOMY 5072.603 47.599 7701.59 2145096 2145096
48 0.75 0.8 200 500000 SigType.DICHOTOMY 4403.341 47.599 7701.59 1739362 1739362
49 0.75 0.85 200 500000 SigType.DICHOTOMY 2735.424 47.599 7701.59 1078937 1078937

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

View File

@@ -0,0 +1,5 @@
similarity_threshold,related_threshold,source_set_amount,reference_set_amount,label,elapsed_time,matches_found
0.5,0.7,500000,200,RAW MATCH,6945.364,230903
0.5,0.75,500000,200,RAW MATCH,6965.759,222613
0.5,0.8,500000,200,RAW MATCH,6974.576,219875
0.5,0.85,500000,200,RAW MATCH,7011.368,195759
1 similarity_threshold related_threshold source_set_amount reference_set_amount label elapsed_time matches_found
2 0.5 0.7 500000 200 RAW MATCH 6945.364 230903
3 0.5 0.75 500000 200 RAW MATCH 6965.759 222613
4 0.5 0.8 500000 200 RAW MATCH 6974.576 219875
5 0.5 0.85 500000 200 RAW MATCH 7011.368 195759

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
0.0,0.7,60000,60000,NO FILTER,3321.166,2.336,115.465,3055067,3055067
0.0,0.75,60000,60000,NO FILTER,1997.976,2.336,115.465,2321584,2321584
0.0,0.8,60000,60000,NO FILTER,1226.647,2.336,115.465,1265300,1265300
0.0,0.85,60000,60000,NO FILTER,530.302,2.336,115.465,642202,642202
0.0,0.7,60000,60000,CHECK FILTER,3766.567,2.336,115.465,3055067,2464704
0.0,0.75,60000,60000,CHECK FILTER,2241.664,2.336,115.465,2321584,1780582
0.0,0.8,60000,60000,CHECK FILTER,1371.372,2.336,115.465,1265300,936432
0.0,0.85,60000,60000,CHECK FILTER,2052.574,2.336,115.465,642202,523745
0.0,0.7,60000,60000,NN FILTER,1752.545,2.336,115.465,3055067,0
0.0,0.75,60000,60000,NN FILTER,1410.607,2.336,115.465,2321584,0
0.0,0.8,60000,60000,NN FILTER,817.098,2.336,115.465,1265300,0
0.0,0.85,60000,60000,NN FILTER,450.277,2.336,115.465,642202,0
0.25,0.7,60000,60000,NO FILTER,4295.794,2.336,115.465,3055067,3055067
0.25,0.75,60000,60000,NO FILTER,1973.377,2.336,115.465,2321584,2321584
0.25,0.8,60000,60000,NO FILTER,1212.983,2.336,115.465,1265300,1265300
0.25,0.85,60000,60000,NO FILTER,522.616,2.336,115.465,642202,642202
0.25,0.7,60000,60000,CHECK FILTER,3200.851,2.336,115.465,3055067,2455726
0.25,0.75,60000,60000,CHECK FILTER,1889.267,2.336,115.465,2321584,1770634
0.25,0.8,60000,60000,CHECK FILTER,1147.932,2.336,115.465,1265300,928712
0.25,0.85,60000,60000,CHECK FILTER,498.44,2.336,115.465,642202,522759
0.25,0.7,60000,60000,NN FILTER,122.104,2.336,115.465,3055067,0
0.25,0.75,60000,60000,NN FILTER,88.259,2.336,115.465,2321584,0
0.25,0.8,60000,60000,NN FILTER,49.714,2.336,115.465,1265300,0
0.25,0.85,60000,60000,NN FILTER,23.838,2.336,115.465,642202,0
0.5,0.7,60000,60000,NO FILTER,3272.056,2.336,115.465,3055067,3055067
0.5,0.75,60000,60000,NO FILTER,1961.328,2.336,115.465,2321584,2321584
0.5,0.8,60000,60000,NO FILTER,1200.994,2.336,115.465,1265300,1265300
0.5,0.85,60000,60000,NO FILTER,511.108,2.336,115.465,642202,642202
0.5,0.7,60000,60000,CHECK FILTER,3183.991,2.336,115.465,3055067,2437997
0.5,0.75,60000,60000,CHECK FILTER,1875.468,2.336,115.465,2321584,1756738
0.5,0.8,60000,60000,CHECK FILTER,1137.157,2.336,115.465,1265300,918967
0.5,0.85,60000,60000,CHECK FILTER,488.508,2.336,115.465,642202,517859
0.5,0.7,60000,60000,NN FILTER,120.567,2.336,115.465,3055067,0
0.5,0.75,60000,60000,NN FILTER,87.173,2.336,115.465,2321584,0
0.5,0.8,60000,60000,NN FILTER,49.292,2.336,115.465,1265300,0
0.5,0.85,60000,60000,NN FILTER,23.97,2.336,115.465,642202,0
0.75,0.7,60000,60000,NO FILTER,3085.617,2.336,115.465,3055067,3055067
0.75,0.75,60000,60000,NO FILTER,1788.559,2.336,115.465,2321584,2321584
0.75,0.8,60000,60000,NO FILTER,1046.714,2.336,115.465,1265300,1265300
0.75,0.85,60000,60000,NO FILTER,481.793,2.336,115.465,642202,642202
0.75,0.7,60000,60000,CHECK FILTER,2991.745,2.336,115.465,3055067,2428269
0.75,0.75,60000,60000,CHECK FILTER,1699.433,2.336,115.465,2321584,1750589
0.75,0.8,60000,60000,CHECK FILTER,983.657,2.336,115.465,1265300,916628
0.75,0.85,60000,60000,CHECK FILTER,458.081,2.336,115.465,642202,516012
0.75,0.7,60000,60000,NN FILTER,119.557,2.336,115.465,3055067,0
0.75,0.75,60000,60000,NN FILTER,86.338,2.336,115.465,2321584,0
0.75,0.8,60000,60000,NN FILTER,48.63,2.336,115.465,1265300,0
0.75,0.85,60000,60000,NN FILTER,23.63,2.336,115.465,642202,0
1 similarity_threshold related_threshold reference_set_amount source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage candidates_amount candidates_amount_after_filtering
2 0.0 0.7 60000 60000 NO FILTER 3321.166 2.336 115.465 3055067 3055067
3 0.0 0.75 60000 60000 NO FILTER 1997.976 2.336 115.465 2321584 2321584
4 0.0 0.8 60000 60000 NO FILTER 1226.647 2.336 115.465 1265300 1265300
5 0.0 0.85 60000 60000 NO FILTER 530.302 2.336 115.465 642202 642202
6 0.0 0.7 60000 60000 CHECK FILTER 3766.567 2.336 115.465 3055067 2464704
7 0.0 0.75 60000 60000 CHECK FILTER 2241.664 2.336 115.465 2321584 1780582
8 0.0 0.8 60000 60000 CHECK FILTER 1371.372 2.336 115.465 1265300 936432
9 0.0 0.85 60000 60000 CHECK FILTER 2052.574 2.336 115.465 642202 523745
10 0.0 0.7 60000 60000 NN FILTER 1752.545 2.336 115.465 3055067 0
11 0.0 0.75 60000 60000 NN FILTER 1410.607 2.336 115.465 2321584 0
12 0.0 0.8 60000 60000 NN FILTER 817.098 2.336 115.465 1265300 0
13 0.0 0.85 60000 60000 NN FILTER 450.277 2.336 115.465 642202 0
14 0.25 0.7 60000 60000 NO FILTER 4295.794 2.336 115.465 3055067 3055067
15 0.25 0.75 60000 60000 NO FILTER 1973.377 2.336 115.465 2321584 2321584
16 0.25 0.8 60000 60000 NO FILTER 1212.983 2.336 115.465 1265300 1265300
17 0.25 0.85 60000 60000 NO FILTER 522.616 2.336 115.465 642202 642202
18 0.25 0.7 60000 60000 CHECK FILTER 3200.851 2.336 115.465 3055067 2455726
19 0.25 0.75 60000 60000 CHECK FILTER 1889.267 2.336 115.465 2321584 1770634
20 0.25 0.8 60000 60000 CHECK FILTER 1147.932 2.336 115.465 1265300 928712
21 0.25 0.85 60000 60000 CHECK FILTER 498.44 2.336 115.465 642202 522759
22 0.25 0.7 60000 60000 NN FILTER 122.104 2.336 115.465 3055067 0
23 0.25 0.75 60000 60000 NN FILTER 88.259 2.336 115.465 2321584 0
24 0.25 0.8 60000 60000 NN FILTER 49.714 2.336 115.465 1265300 0
25 0.25 0.85 60000 60000 NN FILTER 23.838 2.336 115.465 642202 0
26 0.5 0.7 60000 60000 NO FILTER 3272.056 2.336 115.465 3055067 3055067
27 0.5 0.75 60000 60000 NO FILTER 1961.328 2.336 115.465 2321584 2321584
28 0.5 0.8 60000 60000 NO FILTER 1200.994 2.336 115.465 1265300 1265300
29 0.5 0.85 60000 60000 NO FILTER 511.108 2.336 115.465 642202 642202
30 0.5 0.7 60000 60000 CHECK FILTER 3183.991 2.336 115.465 3055067 2437997
31 0.5 0.75 60000 60000 CHECK FILTER 1875.468 2.336 115.465 2321584 1756738
32 0.5 0.8 60000 60000 CHECK FILTER 1137.157 2.336 115.465 1265300 918967
33 0.5 0.85 60000 60000 CHECK FILTER 488.508 2.336 115.465 642202 517859
34 0.5 0.7 60000 60000 NN FILTER 120.567 2.336 115.465 3055067 0
35 0.5 0.75 60000 60000 NN FILTER 87.173 2.336 115.465 2321584 0
36 0.5 0.8 60000 60000 NN FILTER 49.292 2.336 115.465 1265300 0
37 0.5 0.85 60000 60000 NN FILTER 23.97 2.336 115.465 642202 0
38 0.75 0.7 60000 60000 NO FILTER 3085.617 2.336 115.465 3055067 3055067
39 0.75 0.75 60000 60000 NO FILTER 1788.559 2.336 115.465 2321584 2321584
40 0.75 0.8 60000 60000 NO FILTER 1046.714 2.336 115.465 1265300 1265300
41 0.75 0.85 60000 60000 NO FILTER 481.793 2.336 115.465 642202 642202
42 0.75 0.7 60000 60000 CHECK FILTER 2991.745 2.336 115.465 3055067 2428269
43 0.75 0.75 60000 60000 CHECK FILTER 1699.433 2.336 115.465 2321584 1750589
44 0.75 0.8 60000 60000 CHECK FILTER 983.657 2.336 115.465 1265300 916628
45 0.75 0.85 60000 60000 CHECK FILTER 458.081 2.336 115.465 642202 516012
46 0.75 0.7 60000 60000 NN FILTER 119.557 2.336 115.465 3055067 0
47 0.75 0.75 60000 60000 NN FILTER 86.338 2.336 115.465 2321584 0
48 0.75 0.8 60000 60000 NN FILTER 48.63 2.336 115.465 1265300 0
49 0.75 0.85 60000 60000 NN FILTER 23.63 2.336 115.465 642202 0

Binary file not shown.

After

Width:  |  Height:  |  Size: 198 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 164 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 171 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 173 KiB

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.0,0.7,60000,NO FILTER,5210.037,1.383,95.605
0.0,0.75,60000,NO FILTER,4654.41,1.383,95.605
0.0,0.8,60000,NO FILTER,3891.372,1.383,95.605
0.0,0.85,60000,NO FILTER,3561.118,1.383,95.605
0.0,0.7,60000,CHECK FILTER,5374.941,1.383,95.605
0.0,0.75,60000,CHECK FILTER,4772.542,1.383,95.605
0.0,0.8,60000,CHECK FILTER,4004.38,1.383,95.605
0.0,0.85,60000,CHECK FILTER,3653.843,1.383,95.605
0.0,0.7,60000,NN FILTER,3889.903,1.383,95.605
0.0,0.75,60000,NN FILTER,3739.136,1.383,95.605
0.0,0.8,60000,NN FILTER,3609.17,1.383,95.605
0.0,0.85,60000,NN FILTER,3517.33,1.383,95.605
0.25,0.7,60000,NO FILTER,5157.674,1.383,95.605
0.25,0.75,60000,NO FILTER,4621.14,1.383,95.605
0.25,0.8,60000,NO FILTER,3905.856,1.383,95.605
0.25,0.85,60000,NO FILTER,3598.239,1.383,95.605
0.25,0.7,60000,CHECK FILTER,5331.451,1.383,95.605
0.25,0.75,60000,CHECK FILTER,4769.428,1.383,95.605
0.25,0.8,60000,CHECK FILTER,4042.779,1.383,95.605
0.25,0.85,60000,CHECK FILTER,3709.669,1.383,95.605
0.25,0.7,60000,NN FILTER,3910.54,1.383,95.605
0.25,0.75,60000,NN FILTER,3760.587,1.383,95.605
0.25,0.8,60000,NN FILTER,3644.443,1.383,95.605
0.25,0.85,60000,NN FILTER,3558.579,1.383,95.605
0.5,0.7,60000,NO FILTER,5143.478,1.383,95.605
0.5,0.75,60000,NO FILTER,4670.328,1.383,95.605
0.5,0.8,60000,NO FILTER,3917.002,1.383,95.605
0.5,0.85,60000,NO FILTER,3556.487,1.383,95.605
0.5,0.7,60000,CHECK FILTER,5279.287,1.383,95.605
0.5,0.75,60000,CHECK FILTER,4749.58,1.383,95.605
0.5,0.8,60000,CHECK FILTER,4009.224,1.383,95.605
0.5,0.85,60000,CHECK FILTER,3659.874,1.383,95.605
0.5,0.7,60000,NN FILTER,3897.174,1.383,95.605
0.5,0.75,60000,NN FILTER,3771.733,1.383,95.605
0.5,0.8,60000,NN FILTER,3657.094,1.383,95.605
0.5,0.85,60000,NN FILTER,3553.523,1.383,95.605
0.75,0.7,60000,NO FILTER,5107.903,1.383,95.605
0.75,0.75,60000,NO FILTER,4582.298,1.383,95.605
0.75,0.8,60000,NO FILTER,3889.505,1.383,95.605
0.75,0.85,60000,NO FILTER,3559.531,1.383,95.605
0.75,0.7,60000,CHECK FILTER,5254.747,1.383,95.605
0.75,0.75,60000,CHECK FILTER,4722.922,1.383,95.605
0.75,0.8,60000,CHECK FILTER,3977.968,1.383,95.605
0.75,0.85,60000,CHECK FILTER,3635.288,1.383,95.605
0.75,0.7,60000,NN FILTER,3874.915,1.383,95.605
0.75,0.75,60000,NN FILTER,3786.562,1.383,95.605
0.75,0.8,60000,NN FILTER,3901.219,1.383,95.605
0.75,0.85,60000,NN FILTER,3541.992,1.383,95.605
1 similarity_threshold related_threshold source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage
2 0.0 0.7 60000 NO FILTER 5210.037 1.383 95.605
3 0.0 0.75 60000 NO FILTER 4654.41 1.383 95.605
4 0.0 0.8 60000 NO FILTER 3891.372 1.383 95.605
5 0.0 0.85 60000 NO FILTER 3561.118 1.383 95.605
6 0.0 0.7 60000 CHECK FILTER 5374.941 1.383 95.605
7 0.0 0.75 60000 CHECK FILTER 4772.542 1.383 95.605
8 0.0 0.8 60000 CHECK FILTER 4004.38 1.383 95.605
9 0.0 0.85 60000 CHECK FILTER 3653.843 1.383 95.605
10 0.0 0.7 60000 NN FILTER 3889.903 1.383 95.605
11 0.0 0.75 60000 NN FILTER 3739.136 1.383 95.605
12 0.0 0.8 60000 NN FILTER 3609.17 1.383 95.605
13 0.0 0.85 60000 NN FILTER 3517.33 1.383 95.605
14 0.25 0.7 60000 NO FILTER 5157.674 1.383 95.605
15 0.25 0.75 60000 NO FILTER 4621.14 1.383 95.605
16 0.25 0.8 60000 NO FILTER 3905.856 1.383 95.605
17 0.25 0.85 60000 NO FILTER 3598.239 1.383 95.605
18 0.25 0.7 60000 CHECK FILTER 5331.451 1.383 95.605
19 0.25 0.75 60000 CHECK FILTER 4769.428 1.383 95.605
20 0.25 0.8 60000 CHECK FILTER 4042.779 1.383 95.605
21 0.25 0.85 60000 CHECK FILTER 3709.669 1.383 95.605
22 0.25 0.7 60000 NN FILTER 3910.54 1.383 95.605
23 0.25 0.75 60000 NN FILTER 3760.587 1.383 95.605
24 0.25 0.8 60000 NN FILTER 3644.443 1.383 95.605
25 0.25 0.85 60000 NN FILTER 3558.579 1.383 95.605
26 0.5 0.7 60000 NO FILTER 5143.478 1.383 95.605
27 0.5 0.75 60000 NO FILTER 4670.328 1.383 95.605
28 0.5 0.8 60000 NO FILTER 3917.002 1.383 95.605
29 0.5 0.85 60000 NO FILTER 3556.487 1.383 95.605
30 0.5 0.7 60000 CHECK FILTER 5279.287 1.383 95.605
31 0.5 0.75 60000 CHECK FILTER 4749.58 1.383 95.605
32 0.5 0.8 60000 CHECK FILTER 4009.224 1.383 95.605
33 0.5 0.85 60000 CHECK FILTER 3659.874 1.383 95.605
34 0.5 0.7 60000 NN FILTER 3897.174 1.383 95.605
35 0.5 0.75 60000 NN FILTER 3771.733 1.383 95.605
36 0.5 0.8 60000 NN FILTER 3657.094 1.383 95.605
37 0.5 0.85 60000 NN FILTER 3553.523 1.383 95.605
38 0.75 0.7 60000 NO FILTER 5107.903 1.383 95.605
39 0.75 0.75 60000 NO FILTER 4582.298 1.383 95.605
40 0.75 0.8 60000 NO FILTER 3889.505 1.383 95.605
41 0.75 0.85 60000 NO FILTER 3559.531 1.383 95.605
42 0.75 0.7 60000 CHECK FILTER 5254.747 1.383 95.605
43 0.75 0.75 60000 CHECK FILTER 4722.922 1.383 95.605
44 0.75 0.8 60000 CHECK FILTER 3977.968 1.383 95.605
45 0.75 0.85 60000 CHECK FILTER 3635.288 1.383 95.605
46 0.75 0.7 60000 NN FILTER 3874.915 1.383 95.605
47 0.75 0.75 60000 NN FILTER 3786.562 1.383 95.605
48 0.75 0.8 60000 NN FILTER 3901.219 1.383 95.605
49 0.75 0.85 60000 NN FILTER 3541.992 1.383 95.605

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

View File

@@ -0,0 +1,2 @@
experiment name,elem/set,tokens/elem
Schema Matching,4.839676,7.059130404597332
1 experiment name elem/set tokens/elem
2 Schema Matching 4.839676 7.059130404597332

View File

@@ -0,0 +1,21 @@
similarity_threshold,related_threshold,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.0,0.7,12000,12000,162.511,1.149,10.633
0.0,0.7,24000,24000,629.266,0.912,-14.359
0.0,0.7,36000,36000,1448.696,1.047,-3.805
0.0,0.7,48000,48000,2589.084,0.36,8.324
0.0,0.7,60000,60000,4018.602,1.276,30.07
0.0,0.75,12000,12000,156.237,0.079,0.0
0.0,0.75,24000,24000,601.804,0.166,0.0
0.0,0.75,36000,36000,1391.051,0.258,14.434
0.0,0.75,48000,48000,2485.407,1.142,23.73
0.0,0.75,60000,60000,3865.861,1.259,20.078
0.0,0.8,12000,12000,150.844,0.075,0.0
0.0,0.8,24000,24000,579.687,0.169,0.0
0.0,0.8,36000,36000,1337.54,0.259,6.953
0.0,0.8,48000,48000,2393.576,0.365,29.129
0.0,0.8,60000,60000,3731.672,1.298,29.992
0.0,0.85,12000,12000,146.417,0.077,0.0
0.0,0.85,24000,24000,565.317,0.903,-2.0
0.0,0.85,36000,36000,1303.856,1.025,7.91
0.0,0.85,48000,48000,2328.478,1.158,11.004
0.0,0.85,60000,60000,3636.522,1.285,28.184
1 similarity_threshold related_threshold source_set_amount set_size elapsed_time inverted_index_time inverted_index_ram_usage
2 0.0 0.7 12000 12000 162.511 1.149 10.633
3 0.0 0.7 24000 24000 629.266 0.912 -14.359
4 0.0 0.7 36000 36000 1448.696 1.047 -3.805
5 0.0 0.7 48000 48000 2589.084 0.36 8.324
6 0.0 0.7 60000 60000 4018.602 1.276 30.07
7 0.0 0.75 12000 12000 156.237 0.079 0.0
8 0.0 0.75 24000 24000 601.804 0.166 0.0
9 0.0 0.75 36000 36000 1391.051 0.258 14.434
10 0.0 0.75 48000 48000 2485.407 1.142 23.73
11 0.0 0.75 60000 60000 3865.861 1.259 20.078
12 0.0 0.8 12000 12000 150.844 0.075 0.0
13 0.0 0.8 24000 24000 579.687 0.169 0.0
14 0.0 0.8 36000 36000 1337.54 0.259 6.953
15 0.0 0.8 48000 48000 2393.576 0.365 29.129
16 0.0 0.8 60000 60000 3731.672 1.298 29.992
17 0.0 0.85 12000 12000 146.417 0.077 0.0
18 0.0 0.85 24000 24000 565.317 0.903 -2.0
19 0.0 0.85 36000 36000 1303.856 1.025 7.91
20 0.0 0.85 48000 48000 2328.478 1.158 11.004
21 0.0 0.85 60000 60000 3636.522 1.285 28.184

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.0,0.7,60000,SigType.WEIGHTED,5355.864,1.44,96.559
0.0,0.75,60000,SigType.WEIGHTED,4770.741,1.44,96.559
0.0,0.8,60000,SigType.WEIGHTED,4016.552,1.44,96.559
0.0,0.85,60000,SigType.WEIGHTED,3652.589,1.44,96.559
0.0,0.7,60000,SigType.SKYLINE,5320.789,1.44,96.559
0.0,0.75,60000,SigType.SKYLINE,4754.873,1.44,96.559
0.0,0.8,60000,SigType.SKYLINE,3993.905,1.44,96.559
0.0,0.85,60000,SigType.SKYLINE,3637.896,1.44,96.559
0.0,0.7,60000,SigType.DICHOTOMY,5314.17,1.44,96.559
0.0,0.75,60000,SigType.DICHOTOMY,4747.451,1.44,96.559
0.0,0.8,60000,SigType.DICHOTOMY,3987.966,1.44,96.559
0.0,0.85,60000,SigType.DICHOTOMY,3639.406,1.44,96.559
0.25,0.7,60000,SigType.WEIGHTED,5286.204,1.44,96.559
0.25,0.75,60000,SigType.WEIGHTED,4740.2,1.44,96.559
0.25,0.8,60000,SigType.WEIGHTED,3988.353,1.44,96.559
0.25,0.85,60000,SigType.WEIGHTED,3621.661,1.44,96.559
0.25,0.7,60000,SigType.SKYLINE,5272.151,1.44,96.559
0.25,0.75,60000,SigType.SKYLINE,4793.404,1.44,96.559
0.25,0.8,60000,SigType.SKYLINE,4270.868,1.44,96.559
0.25,0.85,60000,SigType.SKYLINE,3897.66,1.44,96.559
0.25,0.7,60000,SigType.DICHOTOMY,5280.093,1.44,96.559
0.25,0.75,60000,SigType.DICHOTOMY,4728.997,1.44,96.559
0.25,0.8,60000,SigType.DICHOTOMY,3971.004,1.44,96.559
0.25,0.85,60000,SigType.DICHOTOMY,3612.607,1.44,96.559
0.5,0.7,60000,SigType.WEIGHTED,5191.199,1.44,96.559
0.5,0.75,60000,SigType.WEIGHTED,4656.862,1.44,96.559
0.5,0.8,60000,SigType.WEIGHTED,3920.386,1.44,96.559
0.5,0.85,60000,SigType.WEIGHTED,3580.435,1.44,96.559
0.5,0.7,60000,SigType.SKYLINE,5180.493,1.44,96.559
0.5,0.75,60000,SigType.SKYLINE,4622.431,1.44,96.559
0.5,0.8,60000,SigType.SKYLINE,3871.093,1.44,96.559
0.5,0.85,60000,SigType.SKYLINE,3525.577,1.44,96.559
0.5,0.7,60000,SigType.DICHOTOMY,5112.984,1.44,96.559
0.5,0.75,60000,SigType.DICHOTOMY,4605.999,1.44,96.559
0.5,0.8,60000,SigType.DICHOTOMY,3876.706,1.44,96.559
0.5,0.85,60000,SigType.DICHOTOMY,3526.946,1.44,96.559
0.75,0.7,60000,SigType.WEIGHTED,5031.754,1.44,96.559
0.75,0.75,60000,SigType.WEIGHTED,4539.266,1.44,96.559
0.75,0.8,60000,SigType.WEIGHTED,3854.313,1.44,96.559
0.75,0.85,60000,SigType.WEIGHTED,3529.814,1.44,96.559
0.75,0.7,60000,SigType.SKYLINE,5037.338,1.44,96.559
0.75,0.75,60000,SigType.SKYLINE,4546.784,1.44,96.559
0.75,0.8,60000,SigType.SKYLINE,3843.47,1.44,96.559
0.75,0.85,60000,SigType.SKYLINE,3524.44,1.44,96.559
0.75,0.7,60000,SigType.DICHOTOMY,5252.169,1.44,96.559
0.75,0.75,60000,SigType.DICHOTOMY,4699.463,1.44,96.559
0.75,0.8,60000,SigType.DICHOTOMY,3928.414,1.44,96.559
0.75,0.85,60000,SigType.DICHOTOMY,3565.332,1.44,96.559
1 similarity_threshold related_threshold source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage
2 0.0 0.7 60000 SigType.WEIGHTED 5355.864 1.44 96.559
3 0.0 0.75 60000 SigType.WEIGHTED 4770.741 1.44 96.559
4 0.0 0.8 60000 SigType.WEIGHTED 4016.552 1.44 96.559
5 0.0 0.85 60000 SigType.WEIGHTED 3652.589 1.44 96.559
6 0.0 0.7 60000 SigType.SKYLINE 5320.789 1.44 96.559
7 0.0 0.75 60000 SigType.SKYLINE 4754.873 1.44 96.559
8 0.0 0.8 60000 SigType.SKYLINE 3993.905 1.44 96.559
9 0.0 0.85 60000 SigType.SKYLINE 3637.896 1.44 96.559
10 0.0 0.7 60000 SigType.DICHOTOMY 5314.17 1.44 96.559
11 0.0 0.75 60000 SigType.DICHOTOMY 4747.451 1.44 96.559
12 0.0 0.8 60000 SigType.DICHOTOMY 3987.966 1.44 96.559
13 0.0 0.85 60000 SigType.DICHOTOMY 3639.406 1.44 96.559
14 0.25 0.7 60000 SigType.WEIGHTED 5286.204 1.44 96.559
15 0.25 0.75 60000 SigType.WEIGHTED 4740.2 1.44 96.559
16 0.25 0.8 60000 SigType.WEIGHTED 3988.353 1.44 96.559
17 0.25 0.85 60000 SigType.WEIGHTED 3621.661 1.44 96.559
18 0.25 0.7 60000 SigType.SKYLINE 5272.151 1.44 96.559
19 0.25 0.75 60000 SigType.SKYLINE 4793.404 1.44 96.559
20 0.25 0.8 60000 SigType.SKYLINE 4270.868 1.44 96.559
21 0.25 0.85 60000 SigType.SKYLINE 3897.66 1.44 96.559
22 0.25 0.7 60000 SigType.DICHOTOMY 5280.093 1.44 96.559
23 0.25 0.75 60000 SigType.DICHOTOMY 4728.997 1.44 96.559
24 0.25 0.8 60000 SigType.DICHOTOMY 3971.004 1.44 96.559
25 0.25 0.85 60000 SigType.DICHOTOMY 3612.607 1.44 96.559
26 0.5 0.7 60000 SigType.WEIGHTED 5191.199 1.44 96.559
27 0.5 0.75 60000 SigType.WEIGHTED 4656.862 1.44 96.559
28 0.5 0.8 60000 SigType.WEIGHTED 3920.386 1.44 96.559
29 0.5 0.85 60000 SigType.WEIGHTED 3580.435 1.44 96.559
30 0.5 0.7 60000 SigType.SKYLINE 5180.493 1.44 96.559
31 0.5 0.75 60000 SigType.SKYLINE 4622.431 1.44 96.559
32 0.5 0.8 60000 SigType.SKYLINE 3871.093 1.44 96.559
33 0.5 0.85 60000 SigType.SKYLINE 3525.577 1.44 96.559
34 0.5 0.7 60000 SigType.DICHOTOMY 5112.984 1.44 96.559
35 0.5 0.75 60000 SigType.DICHOTOMY 4605.999 1.44 96.559
36 0.5 0.8 60000 SigType.DICHOTOMY 3876.706 1.44 96.559
37 0.5 0.85 60000 SigType.DICHOTOMY 3526.946 1.44 96.559
38 0.75 0.7 60000 SigType.WEIGHTED 5031.754 1.44 96.559
39 0.75 0.75 60000 SigType.WEIGHTED 4539.266 1.44 96.559
40 0.75 0.8 60000 SigType.WEIGHTED 3854.313 1.44 96.559
41 0.75 0.85 60000 SigType.WEIGHTED 3529.814 1.44 96.559
42 0.75 0.7 60000 SigType.SKYLINE 5037.338 1.44 96.559
43 0.75 0.75 60000 SigType.SKYLINE 4546.784 1.44 96.559
44 0.75 0.8 60000 SigType.SKYLINE 3843.47 1.44 96.559
45 0.75 0.85 60000 SigType.SKYLINE 3524.44 1.44 96.559
46 0.75 0.7 60000 SigType.DICHOTOMY 5252.169 1.44 96.559
47 0.75 0.75 60000 SigType.DICHOTOMY 4699.463 1.44 96.559
48 0.75 0.8 60000 SigType.DICHOTOMY 3928.414 1.44 96.559
49 0.75 0.85 60000 SigType.DICHOTOMY 3565.332 1.44 96.559

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 211 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 219 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

View File

@@ -0,0 +1,13 @@
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.8,0.7,10000,NO FILTER,3180.351,0.686,63.961
0.8,0.75,10000,NO FILTER,2729.108,0.686,63.961
0.8,0.8,10000,NO FILTER,2185.09,0.686,63.961
0.8,0.85,10000,NO FILTER,1542.041,0.686,63.961
0.8,0.7,10000,CHECK FILTER,2329.334,0.686,63.961
0.8,0.75,10000,CHECK FILTER,2012.022,0.686,63.961
0.8,0.8,10000,CHECK FILTER,1609.739,0.686,63.961
0.8,0.85,10000,CHECK FILTER,1140.994,0.686,63.961
0.8,0.7,10000,NN FILTER,448.129,0.686,63.961
0.8,0.75,10000,NN FILTER,388.975,0.686,63.961
0.8,0.8,10000,NN FILTER,315.568,0.686,63.961
0.8,0.85,10000,NN FILTER,232.207,0.686,63.961
1 similarity_threshold related_threshold source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage
2 0.8 0.7 10000 NO FILTER 3180.351 0.686 63.961
3 0.8 0.75 10000 NO FILTER 2729.108 0.686 63.961
4 0.8 0.8 10000 NO FILTER 2185.09 0.686 63.961
5 0.8 0.85 10000 NO FILTER 1542.041 0.686 63.961
6 0.8 0.7 10000 CHECK FILTER 2329.334 0.686 63.961
7 0.8 0.75 10000 CHECK FILTER 2012.022 0.686 63.961
8 0.8 0.8 10000 CHECK FILTER 1609.739 0.686 63.961
9 0.8 0.85 10000 CHECK FILTER 1140.994 0.686 63.961
10 0.8 0.7 10000 NN FILTER 448.129 0.686 63.961
11 0.8 0.75 10000 NN FILTER 388.975 0.686 63.961
12 0.8 0.8 10000 NN FILTER 315.568 0.686 63.961
13 0.8 0.85 10000 NN FILTER 232.207 0.686 63.961

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

View File

@@ -0,0 +1,13 @@
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.8,0.7,10000,SigType.WEIGHTED,3215.981,0.686,64.16
0.8,0.75,10000,SigType.WEIGHTED,2754.485,0.686,64.16
0.8,0.8,10000,SigType.WEIGHTED,2201.524,0.686,64.16
0.8,0.85,10000,SigType.WEIGHTED,1558.372,0.686,64.16
0.8,0.7,10000,SigType.SKYLINE,3200.56,0.686,64.16
0.8,0.75,10000,SigType.SKYLINE,2757.303,0.686,64.16
0.8,0.8,10000,SigType.SKYLINE,55.38,0.686,64.16
0.8,0.85,10000,SigType.SKYLINE,20.134,0.686,64.16
0.8,0.7,10000,SigType.DICHOTOMY,3151.663,0.686,64.16
0.8,0.75,10000,SigType.DICHOTOMY,2613.546,0.686,64.16
0.8,0.8,10000,SigType.DICHOTOMY,52.873,0.686,64.16
0.8,0.85,10000,SigType.DICHOTOMY,19.331,0.686,64.16
1 similarity_threshold related_threshold source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage
2 0.8 0.7 10000 SigType.WEIGHTED 3215.981 0.686 64.16
3 0.8 0.75 10000 SigType.WEIGHTED 2754.485 0.686 64.16
4 0.8 0.8 10000 SigType.WEIGHTED 2201.524 0.686 64.16
5 0.8 0.85 10000 SigType.WEIGHTED 1558.372 0.686 64.16
6 0.8 0.7 10000 SigType.SKYLINE 3200.56 0.686 64.16
7 0.8 0.75 10000 SigType.SKYLINE 2757.303 0.686 64.16
8 0.8 0.8 10000 SigType.SKYLINE 55.38 0.686 64.16
9 0.8 0.85 10000 SigType.SKYLINE 20.134 0.686 64.16
10 0.8 0.7 10000 SigType.DICHOTOMY 3151.663 0.686 64.16
11 0.8 0.75 10000 SigType.DICHOTOMY 2613.546 0.686 64.16
12 0.8 0.8 10000 SigType.DICHOTOMY 52.873 0.686 64.16
13 0.8 0.85 10000 SigType.DICHOTOMY 19.331 0.686 64.16

Binary file not shown.

After

Width:  |  Height:  |  Size: 199 KiB

View File

@@ -0,0 +1,49 @@
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
0.7,0.7,5000,NO FILTER,3145.41,0.391,28.309
0.7,0.75,5000,NO FILTER,2687.395,0.391,28.309
0.7,0.8,5000,NO FILTER,2244.686,0.391,28.309
0.7,0.85,5000,NO FILTER,1650.297,0.391,28.309
0.7,0.7,5000,CHECK FILTER,4118.279,0.391,28.309
0.7,0.75,5000,CHECK FILTER,3601.918,0.391,28.309
0.7,0.8,5000,CHECK FILTER,2874.443,0.391,28.309
0.7,0.85,5000,CHECK FILTER,2044.612,0.391,28.309
0.7,0.7,5000,NN FILTER,630.678,0.391,28.309
0.7,0.75,5000,NN FILTER,562.722,0.391,28.309
0.7,0.8,5000,NN FILTER,483.175,0.391,28.309
0.7,0.85,5000,NN FILTER,394.221,0.391,28.309
0.75,0.7,5000,NO FILTER,2189.373,0.391,28.309
0.75,0.75,5000,NO FILTER,1891.061,0.391,28.309
0.75,0.8,5000,NO FILTER,1516.5,0.391,28.309
0.75,0.85,5000,NO FILTER,1073.123,0.391,28.309
0.75,0.7,5000,CHECK FILTER,2222.872,0.391,28.309
0.75,0.75,5000,CHECK FILTER,1913.937,0.391,28.309
0.75,0.8,5000,CHECK FILTER,1542.112,0.391,28.309
0.75,0.85,5000,CHECK FILTER,1086.385,0.391,28.309
0.75,0.7,5000,NN FILTER,304.748,0.391,28.309
0.75,0.75,5000,NN FILTER,265.773,0.391,28.309
0.75,0.8,5000,NN FILTER,217.404,0.391,28.309
0.75,0.85,5000,NN FILTER,162.876,0.391,28.309
0.8,0.7,5000,NO FILTER,858.698,0.391,28.309
0.8,0.75,5000,NO FILTER,745.085,0.391,28.309
0.8,0.8,5000,NO FILTER,596.28,0.391,28.309
0.8,0.85,5000,NO FILTER,421.34,0.391,28.309
0.8,0.7,5000,CHECK FILTER,636.886,0.391,28.309
0.8,0.75,5000,CHECK FILTER,550.521,0.391,28.309
0.8,0.8,5000,CHECK FILTER,443.218,0.391,28.309
0.8,0.85,5000,CHECK FILTER,313.208,0.391,28.309
0.8,0.7,5000,NN FILTER,120.012,0.391,28.309
0.8,0.75,5000,NN FILTER,103.497,0.391,28.309
0.8,0.8,5000,NN FILTER,85.033,0.391,28.309
0.8,0.85,5000,NN FILTER,62.035,0.391,28.309
0.85,0.7,5000,NO FILTER,446.251,0.391,28.309
0.85,0.75,5000,NO FILTER,386.611,0.391,28.309
0.85,0.8,5000,NO FILTER,309.98,0.391,28.309
0.85,0.85,5000,NO FILTER,217.511,0.391,28.309
0.85,0.7,5000,CHECK FILTER,364.622,0.391,28.309
0.85,0.75,5000,CHECK FILTER,323.038,0.391,28.309
0.85,0.8,5000,CHECK FILTER,263.697,0.391,28.309
0.85,0.85,5000,CHECK FILTER,184.893,0.391,28.309
0.85,0.7,5000,NN FILTER,72.101,0.391,28.309
0.85,0.75,5000,NN FILTER,62.971,0.391,28.309
0.85,0.8,5000,NN FILTER,51.582,0.391,28.309
0.85,0.85,5000,NN FILTER,35.586,0.391,28.309
1 similarity_threshold related_threshold source_set_amount label elapsed_time inverted_index_time inverted_index_ram_usage
2 0.7 0.7 5000 NO FILTER 3145.41 0.391 28.309
3 0.7 0.75 5000 NO FILTER 2687.395 0.391 28.309
4 0.7 0.8 5000 NO FILTER 2244.686 0.391 28.309
5 0.7 0.85 5000 NO FILTER 1650.297 0.391 28.309
6 0.7 0.7 5000 CHECK FILTER 4118.279 0.391 28.309
7 0.7 0.75 5000 CHECK FILTER 3601.918 0.391 28.309
8 0.7 0.8 5000 CHECK FILTER 2874.443 0.391 28.309
9 0.7 0.85 5000 CHECK FILTER 2044.612 0.391 28.309
10 0.7 0.7 5000 NN FILTER 630.678 0.391 28.309
11 0.7 0.75 5000 NN FILTER 562.722 0.391 28.309
12 0.7 0.8 5000 NN FILTER 483.175 0.391 28.309
13 0.7 0.85 5000 NN FILTER 394.221 0.391 28.309
14 0.75 0.7 5000 NO FILTER 2189.373 0.391 28.309
15 0.75 0.75 5000 NO FILTER 1891.061 0.391 28.309
16 0.75 0.8 5000 NO FILTER 1516.5 0.391 28.309
17 0.75 0.85 5000 NO FILTER 1073.123 0.391 28.309
18 0.75 0.7 5000 CHECK FILTER 2222.872 0.391 28.309
19 0.75 0.75 5000 CHECK FILTER 1913.937 0.391 28.309
20 0.75 0.8 5000 CHECK FILTER 1542.112 0.391 28.309
21 0.75 0.85 5000 CHECK FILTER 1086.385 0.391 28.309
22 0.75 0.7 5000 NN FILTER 304.748 0.391 28.309
23 0.75 0.75 5000 NN FILTER 265.773 0.391 28.309
24 0.75 0.8 5000 NN FILTER 217.404 0.391 28.309
25 0.75 0.85 5000 NN FILTER 162.876 0.391 28.309
26 0.8 0.7 5000 NO FILTER 858.698 0.391 28.309
27 0.8 0.75 5000 NO FILTER 745.085 0.391 28.309
28 0.8 0.8 5000 NO FILTER 596.28 0.391 28.309
29 0.8 0.85 5000 NO FILTER 421.34 0.391 28.309
30 0.8 0.7 5000 CHECK FILTER 636.886 0.391 28.309
31 0.8 0.75 5000 CHECK FILTER 550.521 0.391 28.309
32 0.8 0.8 5000 CHECK FILTER 443.218 0.391 28.309
33 0.8 0.85 5000 CHECK FILTER 313.208 0.391 28.309
34 0.8 0.7 5000 NN FILTER 120.012 0.391 28.309
35 0.8 0.75 5000 NN FILTER 103.497 0.391 28.309
36 0.8 0.8 5000 NN FILTER 85.033 0.391 28.309
37 0.8 0.85 5000 NN FILTER 62.035 0.391 28.309
38 0.85 0.7 5000 NO FILTER 446.251 0.391 28.309
39 0.85 0.75 5000 NO FILTER 386.611 0.391 28.309
40 0.85 0.8 5000 NO FILTER 309.98 0.391 28.309
41 0.85 0.85 5000 NO FILTER 217.511 0.391 28.309
42 0.85 0.7 5000 CHECK FILTER 364.622 0.391 28.309
43 0.85 0.75 5000 CHECK FILTER 323.038 0.391 28.309
44 0.85 0.8 5000 CHECK FILTER 263.697 0.391 28.309
45 0.85 0.85 5000 CHECK FILTER 184.893 0.391 28.309
46 0.85 0.7 5000 NN FILTER 72.101 0.391 28.309
47 0.85 0.75 5000 NN FILTER 62.971 0.391 28.309
48 0.85 0.8 5000 NN FILTER 51.582 0.391 28.309
49 0.85 0.85 5000 NN FILTER 35.586 0.391 28.309

Some files were not shown because too many files have changed in this diff Show More