init
Update README.md
15
.gitignore
vendored
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
.venv/
|
||||||
|
__pycache__/
|
||||||
|
silkmoth.egg-info/
|
||||||
|
build/
|
||||||
|
dist/
|
||||||
|
site/
|
||||||
|
reference_sets_inclusion_dependency.json
|
||||||
|
reference_sets_inclusion_dependency_reduction.json
|
||||||
|
source_sets_inclusion_dependency.json
|
||||||
|
webtable_schemas_sets_500k.json
|
||||||
|
github_webtable_schemas_sets_500k.json
|
||||||
|
|
||||||
|
.vscode/
|
||||||
|
|
||||||
|
silkmoth_env/
|
||||||
152
README.md
Normal file
@@ -0,0 +1,152 @@
|
|||||||
|
# 🦋 LSDIPro SS2025
|
||||||
|
|
||||||
|
## 📄 [SilkMoth: An Efficient Method for Finding Related Sets](https://doi.org/10.14778/3115404.3115413)
|
||||||
|
|
||||||
|
A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 👥 Team Members
|
||||||
|
- **Andreas Wilms**
|
||||||
|
- **Sarra Daknou**
|
||||||
|
- **Amina Iqbal**
|
||||||
|
- **Jakob Berschneider**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Experiments & Results
|
||||||
|
➡️ [**See Experiments**](experiments/README.md)
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 Interactive Demo
|
||||||
|
|
||||||
|
Follow our **step-by-step Jupyter Notebook demo** for a hands-on understanding of SilkMoth
|
||||||
|
|
||||||
|
📓 [**Open demo_example.ipynb**](demo_example.ipynb)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 📘 Project Documentation
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [1. Large Scale Data Integration Project (LSDIPro)](#1-large-scale-data-integration-project-lsdipro)
|
||||||
|
- [2. What is SilkMoth? 🐛](#2-what-is-silkmoth)
|
||||||
|
- [3. The Problem 🧩](#3-the-problem)
|
||||||
|
- [4. SilkMoth’s Solution 🚀](#4-silkmoths-solution)
|
||||||
|
- [5. Core Pipeline Steps 🔁](#5-core-pipeline-steps)
|
||||||
|
- [5.1 Tokenization](#51-tokenization)
|
||||||
|
- [5.2 Inverted Index Construction](#52-inverted-index-construction)
|
||||||
|
- [5.3 Signature Generation](#53-signature-generation)
|
||||||
|
- [5.4 Candidate Selection](#54-candidate-selection)
|
||||||
|
- [5.5 Refinement Filters](#55-refinement-filters)
|
||||||
|
- [5.6 Verification via Maximum Matching](#56-verification-via-maximum-matching)
|
||||||
|
- [6. Modes of Operation 🧪](#6-modes-of-operation-)
|
||||||
|
- [7. Supported Similarity Functions 📐](#7-supported-similarity-functions-)
|
||||||
|
- [8. Installing from Source](#8-installing-from-source)
|
||||||
|
- [9. Experiment Results](#9-experiment-results)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Large Scale Data Integration Project (LSDIPro)
|
||||||
|
|
||||||
|
As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python. The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes.
|
||||||
|
The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. What is SilkMoth?
|
||||||
|
|
||||||
|
**SilkMoth** is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar.
|
||||||
|
This is especially important in **data integration**, **data cleaning**, and **information retrieval**, where messy or inconsistent data is common.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The Problem
|
||||||
|
|
||||||
|
Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using **similarity functions** (not just exact matches).
|
||||||
|
A powerful approach models this as a **bipartite graph** and finds the **maximum matching score** between elements. However, this method is **computationally expensive** (`O(n³)` per pair), making it impractical for large datasets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. SilkMoth’s Solution
|
||||||
|
|
||||||
|
SilkMoth tackles this with a three-step approach:
|
||||||
|
|
||||||
|
1. **Signature Generation**: Creates compact signatures for each set, ensuring related sets share signature parts.
|
||||||
|
2. **Pruning**: Filters out unrelated sets early, reducing candidates.
|
||||||
|
3. **Verification**: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Core Pipeline Steps
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
*Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.*
|
||||||
|
|
||||||
|
### 5.1 Tokenization
|
||||||
|
|
||||||
|
Each element in every set is tokenized based on the selected similarity function:
|
||||||
|
- **Jaccard Similarity**: Elements are split into whitespace-delimited tokens.
|
||||||
|
- **Edit Similarity**: Elements are split into overlapping `q`-grams (e.g., 3-grams).
|
||||||
|
|
||||||
|
### 5.2 Inverted Index Construction
|
||||||
|
|
||||||
|
An **inverted index** is built from the reference set `R` to map each token to a list of `(set, element)` pairs in which it occurs.
|
||||||
|
This allows fast lookup of candidate sets sharing tokens with a query.
|
||||||
|
|
||||||
|
### 5.3 Signature Generation
|
||||||
|
|
||||||
|
A **signature** is a subset of tokens selected from each set such that:
|
||||||
|
- Any related set must share at least one signature token.
|
||||||
|
- Signature size is minimized to reduce candidate space.
|
||||||
|
|
||||||
|
Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly.
|
||||||
|
|
||||||
|
### 5.4 Candidate Selection
|
||||||
|
|
||||||
|
For each set `R`, retrieve from the inverted index all sets `S` sharing at least one token with `R`’s signature. These become **candidate sets** for further evaluation.
|
||||||
|
|
||||||
|
### 5.5 Refinement Filters
|
||||||
|
|
||||||
|
Two filters reduce false positives among candidates:
|
||||||
|
- **Check Filter**: Uses an upper bound on similarity to eliminate sets below threshold.
|
||||||
|
- **Nearest Neighbor Filter**: Approximates maximum matching score using nearest neighbor similarity for each element in `R`.
|
||||||
|
|
||||||
|
### 5.6 Verification via Maximum Matching
|
||||||
|
|
||||||
|
Compute **maximum weighted bipartite matching** between elements of `R` and `S` for remaining candidates using the similarity function as edge weights.
|
||||||
|
Sets meeting or exceeding threshold `δ` are considered **related**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Modes of Operation 🧪
|
||||||
|
|
||||||
|
- **Discovery Mode**: Compare all pairs of sets to find all related pairs.
|
||||||
|
*Use case:* Finding related columns in databases.
|
||||||
|
|
||||||
|
- **Search Mode**: Given a reference set, find all related sets.
|
||||||
|
*Use case:* Schema matching or entity deduplication.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Supported Similarity Functions 📐
|
||||||
|
|
||||||
|
- **Jaccard Similarity**
|
||||||
|
- **Edit Similarity** (Levenshtein-based)
|
||||||
|
- Optional minimum similarity threshold `α` on element comparisons.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Installing from Source
|
||||||
|
|
||||||
|
1. Run `pip install src/` to install
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## 9. Experiment Results
|
||||||
|
|
||||||
|
[📊 See Experiments and Results](experiments/README.md)
|
||||||
823
demo_example.ipynb
Normal file
@@ -0,0 +1,823 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c9f89a47",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## SilkMoth Demo"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "2ca15800",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Related Set Discovery task under Set‑Containment using Jaccard Similarity"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ea6ce5fb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Import of all required modules:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 24,
|
||||||
|
"id": "bdd1b92c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import sys\n",
|
||||||
|
"sys.path.append(\"src\")\n",
|
||||||
|
"\n",
|
||||||
|
"from silkmoth.tokenizer import Tokenizer\n",
|
||||||
|
"from silkmoth.inverted_index import InvertedIndex\n",
|
||||||
|
"from silkmoth.signature_generator import SignatureGenerator\n",
|
||||||
|
"from silkmoth.candidate_selector import CandidateSelector\n",
|
||||||
|
"from silkmoth.verifier import Verifier\n",
|
||||||
|
"from silkmoth.silkmoth_engine import SilkMothEngine\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"from silkmoth.utils import jaccard_similarity, contain, edit_similarity, similar, SigType\n",
|
||||||
|
"\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"from IPython.display import display, Markdown\n",
|
||||||
|
"\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pandas as pd"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "bf6bf1f5",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Define example related dataset from \"SilkMoth\" paper (reference set **R** and source sets **S**)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 25,
|
||||||
|
"id": "598a4bbf",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Reference set (R):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[0]: “77 Mass Ave Boston MA”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[1]: “5th St 02115 Seattle WA”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[2]: “77 5th St Chicago IL”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Source sets (S):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Location Dataset\n",
|
||||||
|
"reference_set = [\n",
|
||||||
|
" '77 Mass Ave Boston MA',\n",
|
||||||
|
" '5th St 02115 Seattle WA',\n",
|
||||||
|
" '77 5th St Chicago IL'\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"# Address Dataset\n",
|
||||||
|
"source_sets = [\n",
|
||||||
|
" ['Mass Ave St Boston 02115','77 Mass 5th St Boston','77 Mass Ave 5th 02115'],\n",
|
||||||
|
" ['77 Boston MA','77 5th St Boston 02115','77 Mass Ave 02115 Seattle'],\n",
|
||||||
|
" ['77 Mass Ave 5th Boston MA','Mass Ave Chicago IL','77 Mass Ave St'],\n",
|
||||||
|
" ['77 Mass Ave MA','5th St 02115 Seattle WA','77 5th St Boston Seattle']\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"# thresholds & q\n",
|
||||||
|
"δ = 0.7\n",
|
||||||
|
"α = 0.0\n",
|
||||||
|
"q = 3\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Reference set (R):**\"))\n",
|
||||||
|
"for i, r in enumerate(reference_set):\n",
|
||||||
|
" display(Markdown(f\"- R[{i}]: “{r}”\"))\n",
|
||||||
|
"display(Markdown(\"**Source sets (S):**\"))\n",
|
||||||
|
"for j, S in enumerate(source_sets):\n",
|
||||||
|
" display(Markdown(f\"- S[{j}]: “{' | '.join(S)}”\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "a50b350a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 1. Tokenization\n",
|
||||||
|
"Tokenize each element of R and each S using Jaccard Similarity (whitespace tokens)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 26,
|
||||||
|
"id": "55e7b5d0",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Tokenized Reference set (R):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[0]: {'Ave', 'MA', '77', 'Boston', 'Mass'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[1]: {'5th', 'Seattle', 'St', 'WA', '02115'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[2]: {'77', '5th', 'IL', 'St', 'Chicago'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Tokenized Source sets (S):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[0]: [{'Ave', 'Boston', 'St', 'Mass', '02115'}, {'77', 'Boston', '5th', 'St', 'Mass'}, {'Ave', '77', '5th', 'Mass', '02115'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[1]: [{'Boston', 'MA', '77'}, {'77', 'Boston', '5th', 'St', '02115'}, {'Ave', '77', 'Seattle', 'Mass', '02115'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[2]: [{'Ave', 'MA', '77', 'Boston', '5th', 'Mass'}, {'IL', 'Ave', 'Mass', 'Chicago'}, {'St', 'Ave', 'Mass', '77'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[3]: [{'Ave', 'Mass', '77', 'MA'}, {'5th', 'Seattle', 'St', 'WA', '02115'}, {'77', 'Boston', '5th', 'Seattle', 'St'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"tokenizer = Tokenizer(jaccard_similarity, q)\n",
|
||||||
|
"tokenized_R = tokenizer.tokenize(reference_set)\n",
|
||||||
|
"tokenized_S = [tokenizer.tokenize(S) for S in source_sets]\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Tokenized Reference set (R):**\"))\n",
|
||||||
|
"for i, toks in enumerate(tokenized_R):\n",
|
||||||
|
" display(Markdown(f\"- Tokens of R[{i}]: {toks}\"))\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Tokenized Source sets (S):**\"))\n",
|
||||||
|
"for i, toks in enumerate(tokenized_S):\n",
|
||||||
|
" display(Markdown(f\"- Tokens of S[{i}]: {toks}\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "e17b807b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 2. Build Inverted Index\n",
|
||||||
|
"Builds an inverted index on the tokenized source sets and shows an example lookup."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 27,
|
||||||
|
"id": "22c7d1d6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Index built over 4 source sets."
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Example: token “Mass” appears in [(0, 0), (0, 1), (0, 2), (1, 2), (2, 0), (2, 1), (2, 2), (3, 0)]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"index = InvertedIndex(tokenized_S)\n",
|
||||||
|
"display(Markdown(f\"- Index built over {len(source_sets)} source sets.\"))\n",
|
||||||
|
"display(Markdown(f\"- Example: token “Mass” appears in {index.get_indexes('Mass')}\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "cc17daac",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 3. Signature Generation"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "1c48bac2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Generates the weighted signature for R given δ, α (here α=0), using Jaccard Similarity."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 28,
|
||||||
|
"id": "a36be65c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Selected signature tokens: **['Chicago', 'WA', 'IL', '5th']**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"sig_gen = SignatureGenerator()\n",
|
||||||
|
"signature = sig_gen.get_signature(\n",
|
||||||
|
" tokenized_R, index,\n",
|
||||||
|
" delta=δ, alpha=α,\n",
|
||||||
|
" sig_type=SigType.WEIGHTED,\n",
|
||||||
|
" sim_fun=jaccard_similarity,\n",
|
||||||
|
" q=q\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"- Selected signature tokens: **{signature}**\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "938be3e2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 4. Initial Candidate Selection\n",
|
||||||
|
"\n",
|
||||||
|
"Looks up each signature token in the inverted index to form the candidate set.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 29,
|
||||||
|
"id": "58017e27",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Candidate set indices: **[0, 1, 2, 3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"cand_sel = CandidateSelector(\n",
|
||||||
|
" similarity_func=jaccard_similarity,\n",
|
||||||
|
" sim_metric=contain,\n",
|
||||||
|
" related_thresh=δ,\n",
|
||||||
|
" sim_thresh=α,\n",
|
||||||
|
" q=q\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"initial_cands = cand_sel.get_candidates(signature, index, len(tokenized_R))\n",
|
||||||
|
"display(Markdown(f\"- Candidate set indices: **{sorted(initial_cands)}**\"))\n",
|
||||||
|
"for j in sorted(initial_cands):\n",
|
||||||
|
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d633e5f9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 5. Check Filter\n",
|
||||||
|
"Prunes candidates by ensuring each matched element passes the local similarity bound.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 30,
|
||||||
|
"id": "9a2bfdeb",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Surviving after check filter:** **[0, 1, 3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[0] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **0.429** | Matched elements: **1**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[1] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **0.429** | Matched elements: **1**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[3] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[1] “5th St 02115 Seattle WA” → sim = 1.000"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **1.000** | Matched elements: **2**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"filtered_cands, match_map = cand_sel.check_filter(\n",
|
||||||
|
" tokenized_R, set(signature), initial_cands, index\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"**Surviving after check filter:** **{sorted(filtered_cands)}**\"))\n",
|
||||||
|
"for j in sorted(filtered_cands):\n",
|
||||||
|
" display(Markdown(f\"S[{j}] matched:\"))\n",
|
||||||
|
" for r_idx, sim in match_map[j].items():\n",
|
||||||
|
" sim_text = f\"{sim:.3f}\"\n",
|
||||||
|
" display(Markdown(f\" • R[{r_idx}] “{reference_set[r_idx]}” → sim = {sim_text}\"))\n",
|
||||||
|
" \n",
|
||||||
|
" matches = match_map.get(j, {})\n",
|
||||||
|
" if matches:\n",
|
||||||
|
" best_sim = max(matches.values())\n",
|
||||||
|
" num_matches = len(matches)\n",
|
||||||
|
" display(Markdown(f\" → Best sim: **{best_sim:.3f}** | Matched elements: **{num_matches}**\"))\n",
|
||||||
|
" else:\n",
|
||||||
|
" display(Markdown(f\"No elements passed similarity checks.\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "cc37bb7f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 6. Nearest‑Neighbor Filter\n",
|
||||||
|
"\n",
|
||||||
|
"Further prunes via nearest‑neighbor upper bounds on total matching score.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 31,
|
||||||
|
"id": "aa9b7a63",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Surviving after NN filter: **[3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"nn_filtered = cand_sel.nn_filter(\n",
|
||||||
|
" tokenized_R, set(signature), filtered_cands,\n",
|
||||||
|
" index, threshold=δ, match_map=match_map\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"- Surviving after NN filter: **{sorted(nn_filtered)}**\"))\n",
|
||||||
|
"for j in nn_filtered:\n",
|
||||||
|
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "8638f83a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 7. Verification\n",
|
||||||
|
"\n",
|
||||||
|
"Runs the bipartite max‑matching on the remaining candidates and outputs the final related sets.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 32,
|
||||||
|
"id": "ebdf20fe",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"Final related sets (score ≥ 0.7):"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle” → **0.743**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"verifier = Verifier(δ, contain, jaccard_similarity, sim_thresh=α, reduction=False)\n",
|
||||||
|
"results = verifier.get_related_sets(tokenized_R, nn_filtered, index)\n",
|
||||||
|
"\n",
|
||||||
|
"if results:\n",
|
||||||
|
" display(Markdown(f\"Final related sets (score ≥ {δ}):\"))\n",
|
||||||
|
" for j, score in results:\n",
|
||||||
|
" display(Markdown(f\" • S[{j}]: “{' | '.join(source_sets[j])}” → **{score:.3f}**\"))\n",
|
||||||
|
"else:\n",
|
||||||
|
" display(Markdown(\"- No sets passed verification.\"))\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "silkmoth_env",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.13"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
BIN
docs/ImplementationPlan.pdf
Normal file
3
docs/README.md
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
The initial draft of the SilkMoth system and process was created using Draw.io. Refer to the file `SilkMoth.drawio` and its exported image, `SilkMoth.png`.
|
||||||
|
|
||||||
|
For a detailed implementation plan refer to `plan.tex` and `ImplementationPlan.pdf`.
|
||||||
406
docs/SilkMoth.drawio
Normal file
@@ -0,0 +1,406 @@
|
|||||||
|
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" version="26.2.14">
|
||||||
|
<diagram name="Page-1" id="a6IaXev5Jbf4Zx6BKyVR">
|
||||||
|
<mxGraphModel dx="3390" dy="2158" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" background="#ffffff" math="0" shadow="0">
|
||||||
|
<root>
|
||||||
|
<mxCell id="0" />
|
||||||
|
<mxCell id="1" parent="0" />
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-159" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-1" target="rYVZWEPrfZzp95ZC9z8C-3">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="280" y="265" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-1" value="<i>R</i> = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="196.75" y="30" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-153" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-2" target="rYVZWEPrfZzp95ZC9z8C-152">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-2" value="<i>S</i> = {S1, S2, S3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-686" y="245" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-38" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-3" target="rYVZWEPrfZzp95ZC9z8C-36">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-3" value="Tokenize R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="27.5" y="240" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-131" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-140">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="29.40000000000009" y="-149.99999999999977" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-132" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-136">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="198.29999999999973" y="-149.99999999999977" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-6" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="100" y="-170" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-167" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-14" target="rYVZWEPrfZzp95ZC9z8C-164">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-14" value="relatedness&nbsp;<div>threshold <span class="katex"><span style="height: 0.6944em;" class="strut"></span><span style="margin-right: 0.0379em;" class="mord mathnormal">δ</span></span></div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-511" y="675" width="200" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-155" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-22" target="rYVZWEPrfZzp95ZC9z8C-26">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-156" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-22" target="rYVZWEPrfZzp95ZC9z8C-24">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-22" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="100" y="-370" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-178" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-24" target="rYVZWEPrfZzp95ZC9z8C-152">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="-430" y="140" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-390" y="-350" />
|
||||||
|
<mxPoint x="-390" y="40" />
|
||||||
|
<mxPoint x="-410" y="40" />
|
||||||
|
<mxPoint x="-410" y="70" />
|
||||||
|
<mxPoint x="-394" y="70" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-180" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-24" target="rYVZWEPrfZzp95ZC9z8C-6">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-60" y="-220" />
|
||||||
|
<mxPoint x="120" y="-220" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-24" value="Jaccard<div>(whitespace words)</div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-144.5" y="-370" width="180" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-179" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-26" target="rYVZWEPrfZzp95ZC9z8C-152">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="-420" y="190" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="282" y="-400" />
|
||||||
|
<mxPoint x="-390" y="-400" />
|
||||||
|
<mxPoint x="-390" y="40" />
|
||||||
|
<mxPoint x="-410" y="40" />
|
||||||
|
<mxPoint x="-410" y="70" />
|
||||||
|
<mxPoint x="-394" y="70" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-181" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-26" target="rYVZWEPrfZzp95ZC9z8C-6">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="282" y="-220" />
|
||||||
|
<mxPoint x="120" y="-220" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-26" value=" Edit Similarity<div>(q-gram)</div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="196.75" y="-370" width="170" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-28" value="similarity&nbsp;<span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));">threshold&nbsp;</span><span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));" class="katex"><span style="height: 0.4306em;" class="strut"></span><span style="margin-right: 0.0037em;" class="mord mathnormal">α</span></span><div><span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));" class="katex"><span style="margin-right: 0.0037em;" class="mord mathnormal">baseline = 0</span></span></div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-291" y="-370" width="190" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-40" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36" target="rYVZWEPrfZzp95ZC9z8C-39">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-168" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36" target="rYVZWEPrfZzp95ZC9z8C-47">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-36" value="R Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="22.5" y="340" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-45" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-39" target="rYVZWEPrfZzp95ZC9z8C-44">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-39" value="Inverted Index Creation" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="27.5" y="505" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-69" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-44" target="rYVZWEPrfZzp95ZC9z8C-67">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-44" value="Inverted Index" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="320" y="500" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-170" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-47" target="rYVZWEPrfZzp95ZC9z8C-169">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-47" value="Signature Generation R<div>(weighted)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="280" y="335" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-68" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-63" target="rYVZWEPrfZzp95ZC9z8C-67">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-63" value="S Signatures" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="320" y="665" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-154" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-66" target="rYVZWEPrfZzp95ZC9z8C-22">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-66" value="Start" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.start_2;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="90" y="-530" width="60" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-67" value="Candidate Selection" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="555" y="670" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-71" value="<div><br></div>Candidates<div><br></div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="800" y="665" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-72" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-67" target="rYVZWEPrfZzp95ZC9z8C-71">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" target="rYVZWEPrfZzp95ZC9z8C-87">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-73" value="Check Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="1110" y="670" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-100" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-73">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-105" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-100">
|
||||||
|
<mxGeometry x="-0.2471" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-107" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-106">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-108" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-107">
|
||||||
|
<mxGeometry x="-0.3013" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-77" value="Refinement" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="960" y="655" width="80" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-109" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-106">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1398" y="835" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-87" value="NN Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="1320" y="670" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-71" target="rYVZWEPrfZzp95ZC9z8C-77">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-116" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-106" target="rYVZWEPrfZzp95ZC9z8C-115">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-106" value="Verification" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="922.5" y="810" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-110" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-71">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="1350" y="620" as="sourcePoint" />
|
||||||
|
<mxPoint x="1000" y="600" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1398" y="600" />
|
||||||
|
<mxPoint x="845" y="600" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-114" value="update" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-110">
|
||||||
|
<mxGeometry x="0.5022" y="-2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-113" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="1180" y="670" as="sourcePoint" />
|
||||||
|
<mxPoint x="1188" y="600" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-117">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-121" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-118">
|
||||||
|
<mxGeometry x="0.0133" y="-3" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-119">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-123" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-122">
|
||||||
|
<mxGeometry x="-0.2333" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-115" value="use triangle optimization" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="955" y="920" width="90" height="100" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-117" target="rYVZWEPrfZzp95ZC9z8C-119">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1198" y="1095" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-117" value="Triangle Optimization" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="1120" y="945" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-127" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-119" target="rYVZWEPrfZzp95ZC9z8C-126">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-119" value="Create Bipartite Matching Graph" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="931.25" y="1070" width="137.5" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-120" value="" style="endArrow=none;dashed=1;html=1;dashPattern=1 3;strokeWidth=2;rounded=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" target="rYVZWEPrfZzp95ZC9z8C-71">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="930" y="1095" as="sourcePoint" />
|
||||||
|
<mxPoint x="845" y="730" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="845" y="1095" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-125" value="using" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-120">
|
||||||
|
<mxGeometry x="-0.2193" y="1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-185" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-126" target="rYVZWEPrfZzp95ZC9z8C-182">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-126" value="<div><br></div><div>Related SETS</div><div>(R,S)</div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="955" y="1160" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-128" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="965" y="1420" width="70" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-138" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-136" target="rYVZWEPrfZzp95ZC9z8C-1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-136" value="RELATED SET&nbsp;<div>SEARCH (target)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="199.25" y="-175" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-139" target="rYVZWEPrfZzp95ZC9z8C-142">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-144" value="x = 1" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-143">
|
||||||
|
<mxGeometry x="-0.24" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-139" value="<i>R</i> = {R1, R2, R3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-115" y="-70" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-141" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-140" target="rYVZWEPrfZzp95ZC9z8C-139">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-140" value="RELATED SET&nbsp;<div>DISCOVERY (general)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-110" y="-175" width="150" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-146" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-142" target="rYVZWEPrfZzp95ZC9z8C-145">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-142" value="Take SET Rx" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-87.5" y="30" width="105" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-160" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-145" target="rYVZWEPrfZzp95ZC9z8C-3">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-145" value="<i>Rx</i> = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-115" y="120" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-142">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-191" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-149">
|
||||||
|
<mxGeometry x="-0.4171" y="2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-148" value="x &lt;= R.length" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-260" y="20" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-150" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-250" y="120" width="70" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-150">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-192" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-151">
|
||||||
|
<mxGeometry x="-0.1457" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-162" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-152" target="rYVZWEPrfZzp95ZC9z8C-161">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-152" value="Tokenize S" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-471" y="240" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-163" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-161" target="rYVZWEPrfZzp95ZC9z8C-39">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-179" y="530" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-165" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-161" target="rYVZWEPrfZzp95ZC9z8C-164">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-161" value="S Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-476" y="340" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-164" value="Signature Generation S<div>(weighted)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-256" y="670" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-166" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-164" target="rYVZWEPrfZzp95ZC9z8C-63">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-171" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-169" target="rYVZWEPrfZzp95ZC9z8C-67">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-169" value="R Signature" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="550" y="340" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-173" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-172" target="rYVZWEPrfZzp95ZC9z8C-47">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-172" value="relatedness&nbsp;<div>threshold <span class="katex"><span style="height: 0.6944em;" class="strut"></span><span style="margin-right: 0.0379em;" class="mord mathnormal">δ</span></span></div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="257.5" y="404" width="200" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-187" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-186">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-190" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-187">
|
||||||
|
<mxGeometry x="-0.184" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-182" value="DISCOVERY<div>Mode</div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="945" y="1270" width="110" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-183" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-128">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-184" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" vertex="1" connectable="0" parent="rYVZWEPrfZzp95ZC9z8C-183">
|
||||||
|
<mxGeometry x="-0.1257" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-188" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-186" target="rYVZWEPrfZzp95ZC9z8C-148">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="-760" y="50" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-759" y="1310" />
|
||||||
|
<mxPoint x="-759" y="50" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-186" value="Increment x" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="715" y="1285" width="137.5" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
</root>
|
||||||
|
</mxGraphModel>
|
||||||
|
</diagram>
|
||||||
|
</mxfile>
|
||||||
BIN
docs/SilkMoth.png
Normal file
|
After Width: | Height: | Size: 250 KiB |
494
docs/SilkMoth_v2.drawio
Normal file
@@ -0,0 +1,494 @@
|
|||||||
|
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36" version="24.8.6">
|
||||||
|
<diagram name="Page-1" id="a6IaXev5Jbf4Zx6BKyVR">
|
||||||
|
<mxGraphModel dx="3785" dy="2313" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" background="#ffffff" math="0" shadow="0">
|
||||||
|
<root>
|
||||||
|
<mxCell id="0" />
|
||||||
|
<mxCell id="1" parent="0" />
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-120" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-1" target="rYVZWEPrfZzp95ZC9z8C-3">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="280" y="150" />
|
||||||
|
<mxPoint x="111" y="150" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-1" value="<i>R</i> = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="190.75" y="-80" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-3">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="110.5" y="340" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-3" value="Tokenize R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="33" y="240" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-131" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-140" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="33.40000000000009" y="-354.9999999999998" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-132" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-6" target="rYVZWEPrfZzp95ZC9z8C-136" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="202.29999999999973" y="-354.9999999999998" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-6" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="104" y="-375" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-123" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-36">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="105.5" y="420" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-36" value="R Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="23" y="343" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-102" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-66" target="W6bMp2RoBO1kHS_2JlRQ-100">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-66" value="Start" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.start_2;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="90" y="-610" width="60" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-67" value="Candidate Selection" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1002" y="99" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-71" value="<div><br></div>Candidates<div><br></div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1395" y="-45" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" target="rYVZWEPrfZzp95ZC9z8C-87" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-73" value="Check Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1705" y="-40" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-100" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-73" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-105" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-100" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.2471" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-107" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-77" target="rYVZWEPrfZzp95ZC9z8C-106" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-108" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-107" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.3013" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-77" value="Refinement" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1555" y="-55" width="80" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-109" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-106" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1993" y="125" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-87" value="NN Filter" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1915" y="-40" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-71" target="rYVZWEPrfZzp95ZC9z8C-77" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-116" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-106" target="rYVZWEPrfZzp95ZC9z8C-115" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-106" value="Verification" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1517.5" y="100" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-110" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-87" target="rYVZWEPrfZzp95ZC9z8C-71" edge="1">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="1945" y="-90" as="sourcePoint" />
|
||||||
|
<mxPoint x="1595" y="-110" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1993" y="-110" />
|
||||||
|
<mxPoint x="1440" y="-110" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-114" value="update" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-110" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="0.5022" y="-2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-113" value="" style="endArrow=none;dashed=1;html=1;rounded=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-73" edge="1">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="1775" y="-40" as="sourcePoint" />
|
||||||
|
<mxPoint x="1783" y="-110" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-117" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-121" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-118" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="0.0133" y="-3" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-122" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-115" target="rYVZWEPrfZzp95ZC9z8C-119" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-123" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-122" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.2333" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-115" value="use triangle optimization" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1550" y="210" width="90" height="100" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-117" target="rYVZWEPrfZzp95ZC9z8C-119" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1793" y="385" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-117" value="Triangle Optimization" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1715" y="235" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-63" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-119" target="W6bMp2RoBO1kHS_2JlRQ-64">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="1595" y="460" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-119" value="Create Bipartite Matching Graph" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1526.25" y="360" width="137.5" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-120" value="" style="endArrow=none;dashed=1;html=1;dashPattern=1 3;strokeWidth=2;rounded=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" target="rYVZWEPrfZzp95ZC9z8C-71" edge="1">
|
||||||
|
<mxGeometry width="50" height="50" relative="1" as="geometry">
|
||||||
|
<mxPoint x="1525" y="385" as="sourcePoint" />
|
||||||
|
<mxPoint x="1440" y="20" as="targetPoint" />
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1440" y="385" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-125" value="using" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-120" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.2193" y="1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-185" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-126" target="rYVZWEPrfZzp95ZC9z8C-182" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-126" value="<div><br></div><div>Related SETS</div><div>(R,S)</div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1550" y="600" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-128" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1560" y="860" width="70" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-83" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-136" target="W6bMp2RoBO1kHS_2JlRQ-82">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-136" value="RELATED SET&nbsp;<div>SEARCH (target)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="203.25" y="-380" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-139" target="rYVZWEPrfZzp95ZC9z8C-142" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-144" value="x = 1" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-143" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.24" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-139" value="<i>R</i> = {R1, R2, R3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-121" y="-180" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-85" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-140" target="W6bMp2RoBO1kHS_2JlRQ-81">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-140" value="RELATED SET&nbsp;<div>DISCOVERY (general)</div>" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-106" y="-380" width="150" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-146" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-142" target="rYVZWEPrfZzp95ZC9z8C-145" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-142" value="Take SET Rx" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-93.5" y="-80" width="105" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-119" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-145" target="rYVZWEPrfZzp95ZC9z8C-3">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-145" value="<i>Rx</i> = {r1, r2, r3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-121" y="10" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-142" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-191" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-149" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.4171" y="2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-148" value="x &lt;= R.length" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-266" y="-90" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-150" value="END" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.terminator;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="-256" y="10" width="70" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-148" target="rYVZWEPrfZzp95ZC9z8C-150" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-192" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-151" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.1457" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-136" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.25;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-169" target="rYVZWEPrfZzp95ZC9z8C-67">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-169" value="R Signature" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="694.08" y="230" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-187" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-186" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-190" value="Yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-187" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.184" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-182" value="DISCOVERY<div>Mode</div>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1540" y="710" width="110" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-183" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=none;fontColor=default;" parent="1" source="rYVZWEPrfZzp95ZC9z8C-182" target="rYVZWEPrfZzp95ZC9z8C-128" edge="1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-184" value="No" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];labelBackgroundColor=none;" parent="rYVZWEPrfZzp95ZC9z8C-183" vertex="1" connectable="0">
|
||||||
|
<mxGeometry x="-0.1257" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-155" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-186" target="rYVZWEPrfZzp95ZC9z8C-148">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-610" y="750" />
|
||||||
|
<mxPoint x="-610" y="-60" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="rYVZWEPrfZzp95ZC9z8C-186" value="Increment x" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" parent="1" vertex="1">
|
||||||
|
<mxGeometry x="1310" y="725" width="137.5" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-139" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-23">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="270.75000000000045" y="530" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-141" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-139">
|
||||||
|
<mxGeometry x="-0.2758" y="-1" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-152" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-23" target="rYVZWEPrfZzp95ZC9z8C-169">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="271" y="250" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-153" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-152">
|
||||||
|
<mxGeometry x="-0.873" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-23" value="<font style="font-size: 8px;">alpha = 0?</font>" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="241.25" y="418" width="59" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-124" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-25">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="240" y="448" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-25" value="Weighted Signature Generation R" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="23.5" y="423" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-31" value="Sim-thresh Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="195.75" y="534" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-143" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-41" target="W6bMp2RoBO1kHS_2JlRQ-54">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-144" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-143">
|
||||||
|
<mxGeometry x="0.1529" y="2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-149" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-41" target="W6bMp2RoBO1kHS_2JlRQ-146">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="550" y="345" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-150" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-149">
|
||||||
|
<mxGeometry x="-0.8299" y="2" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-41" value="<font style="font-size: 9px;">Optimization?</font>" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.decision;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="510.00000000000006" y="520" width="80" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-147" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-47">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="771.5799999999999" y="370" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-47" value="Dichotomy Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="694.08" y="440" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-148" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-53" target="W6bMp2RoBO1kHS_2JlRQ-146">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="910" y="345" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-53" value="Skyline Signature Scheme" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="860" y="535" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-59" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-54" target="W6bMp2RoBO1kHS_2JlRQ-53">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-145" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-54" target="W6bMp2RoBO1kHS_2JlRQ-47">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-54" value="OR" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="745.33" y="534.5" width="52.5" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-64" value="&nbsp;<font style="font-size: 9px;">relatedness&nbsp;</font><div><font style="font-size: 9px;">≥ δ</font></div>" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="1555" y="450" width="80" height="80" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-65" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-64" target="rYVZWEPrfZzp95ZC9z8C-126">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-67" value="yes" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-65">
|
||||||
|
<mxGeometry x="-0.1543" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-68" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-64" target="rYVZWEPrfZzp95ZC9z8C-128">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1689" y="490" />
|
||||||
|
<mxPoint x="1689" y="880" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-69" value="no" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="W6bMp2RoBO1kHS_2JlRQ-68">
|
||||||
|
<mxGeometry x="0.4329" y="-3" relative="1" as="geometry">
|
||||||
|
<mxPoint as="offset" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-114" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-78">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="490" y="-160" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-78" value="<i>S</i> = {S1, S2, S3, ...}" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="39" y="-180" width="160" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-90" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-81" target="W6bMp2RoBO1kHS_2JlRQ-78">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-81" value="AND" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-51" y="-290" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-89" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-82" target="W6bMp2RoBO1kHS_2JlRQ-78">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-82" value="AND" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="260.75" y="-290" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-93" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.563;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-82" target="rYVZWEPrfZzp95ZC9z8C-1">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-94" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.563;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-81" target="rYVZWEPrfZzp95ZC9z8C-139">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-103" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-95" target="rYVZWEPrfZzp95ZC9z8C-6">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="-50" y="-420" />
|
||||||
|
<mxPoint x="124" y="-420" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-95" value="Jaccard<div>(whitespace words)</div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-130.5" y="-490" width="180" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-104" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-96" target="rYVZWEPrfZzp95ZC9z8C-6">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="306" y="-420" />
|
||||||
|
<mxPoint x="124" y="-420" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-96" value=" Edit Similarity<div>(q-gram)</div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="221" y="-490" width="170" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-97" value="similarity&nbsp;<span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));">threshold&nbsp;</span><span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));" class="katex"><span style="height: 0.4306em;" class="strut"></span><span style="margin-right: 0.0037em;" class="mord mathnormal">α</span></span><div><span style="background-color: transparent; color: light-dark(rgb(0, 0, 0), rgb(255, 255, 255));" class="katex"><span style="margin-right: 0.0037em;" class="mord mathnormal">baseline = 0</span></span></div>" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="-277" y="-490" width="190" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-98" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-100" target="W6bMp2RoBO1kHS_2JlRQ-96">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-99" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-100" target="W6bMp2RoBO1kHS_2JlRQ-95">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-100" value="OR" style="rhombus;whiteSpace=wrap;html=1;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="100" y="-490" width="40" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-108" value="Inverted Index Creation" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="500" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-137" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-109">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="1000.0000000000005" y="130" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-109" value="Inverted Index" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="532.5" y="100" width="90" height="60" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-115" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-111">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<mxPoint x="568.5" y="-100" as="targetPoint" />
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-111" value="Tokenize S" style="rounded=1;whiteSpace=wrap;html=1;absoluteArcSize=1;arcSize=14;strokeWidth=2;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="491" y="-189" width="155" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-112" value="S Tokens" style="shape=parallelogram;html=1;strokeWidth=2;perimeter=parallelogramPerimeter;whiteSpace=wrap;rounded=1;arcSize=12;size=0.23;direction=west;labelBackgroundColor=none;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="490" y="-99" width="165" height="40" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-117" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.462;entryY=-0.033;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-112" target="W6bMp2RoBO1kHS_2JlRQ-108">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-118" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-108" target="W6bMp2RoBO1kHS_2JlRQ-109">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-138" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-31" target="W6bMp2RoBO1kHS_2JlRQ-41">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-146" value="" style="rhombus;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||||
|
<mxGeometry x="745.33" y="320" width="50" height="50" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-151" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.54;entryY=-0.088;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="W6bMp2RoBO1kHS_2JlRQ-146" target="rYVZWEPrfZzp95ZC9z8C-169">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="W6bMp2RoBO1kHS_2JlRQ-154" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="rYVZWEPrfZzp95ZC9z8C-67" target="rYVZWEPrfZzp95ZC9z8C-71">
|
||||||
|
<mxGeometry relative="1" as="geometry">
|
||||||
|
<Array as="points">
|
||||||
|
<mxPoint x="1090" y="-15" />
|
||||||
|
</Array>
|
||||||
|
</mxGeometry>
|
||||||
|
</mxCell>
|
||||||
|
</root>
|
||||||
|
</mxGraphModel>
|
||||||
|
</diagram>
|
||||||
|
</mxfile>
|
||||||
BIN
docs/SilkMoth_v2.png
Normal file
|
After Width: | Height: | Size: 302 KiB |
BIN
docs/figures/Pipeline.png
Normal file
|
After Width: | Height: | Size: 230 KiB |
99
docs/plan.tex
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
\documentclass[a4paper]{article}
|
||||||
|
\usepackage{graphicx} % Required for inserting images
|
||||||
|
\usepackage{pgfgantt}
|
||||||
|
\usepackage{hyperref}
|
||||||
|
|
||||||
|
\title{Implementation Plan - Student Project SilkMoth}
|
||||||
|
\date{April 2025}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
Figure \ref{fig:plan} shows a more detailed version of our initial project plan. Note that some tasks may take longer or could be completed earlier than this plan assumes, and we are willing to adjust the plan according to our resources. We aim to parallelize the implementation tasks during the project whenever possible. We split the project into three phases as follows.
|
||||||
|
|
||||||
|
\begin{enumerate}
|
||||||
|
\item \textbf{(17.4 - 15.05)} - Core Pipeline
|
||||||
|
\begin{itemize}
|
||||||
|
\item Get a common understanding of the system
|
||||||
|
\item Implement the main components without major optimization
|
||||||
|
\item Prepare small data set to test correctness and larger data sets for evaluation phase
|
||||||
|
\item Goal: Runnable code for at least the base case (single search pass, similarity threshold $\alpha = 0$, similarity function $\phi = \texttt{Jac}$)
|
||||||
|
\end{itemize}
|
||||||
|
\item \textbf{(16.5 - 12.06)} - Extended Framework
|
||||||
|
\begin{itemize}
|
||||||
|
\item Improve the core pipeline
|
||||||
|
\item Refinement and optimization
|
||||||
|
\item Support for discovery mode, $\alpha \neq 0$ , $\phi = \texttt{Eds}$ and $\phi = \texttt{NEds}$
|
||||||
|
\item Goal: Most features should be finalized and ready for expert review
|
||||||
|
\end{itemize}
|
||||||
|
\item \textbf{(13.6 - 24.07)} - Evaluation
|
||||||
|
\begin{itemize}
|
||||||
|
\item Improve the system from the feedback and finalize the last functionalities
|
||||||
|
\item Implement the applications to conduct experiments
|
||||||
|
\item Visualize experiment results
|
||||||
|
\item Write report/documentation
|
||||||
|
\item Consider bonus improvements e.g. additional data sets like GitTables\footnote{\url{https://gittables.github.io/}} or additional similarity functions like Hamming similarity\footnote{\url{https://en.wikipedia.org/wiki/Hamming_distance}}
|
||||||
|
\item Goal: Presentation and submission of the final system
|
||||||
|
\end{itemize}
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{figure}[b!]
|
||||||
|
\begin{ganttchart}[
|
||||||
|
vgrid, hgrid,
|
||||||
|
x unit=0.5cm,
|
||||||
|
y unit title=0.75cm,
|
||||||
|
y unit chart=0.5cm,
|
||||||
|
title height=1,
|
||||||
|
milestone left shift=.1,
|
||||||
|
milestone right shift=-.1,
|
||||||
|
group left shift=0,
|
||||||
|
group right shift=0,
|
||||||
|
group peaks tip position=0,
|
||||||
|
group peaks height=0.2,
|
||||||
|
title label font=\small,
|
||||||
|
bar label font=\small,
|
||||||
|
group label font=\small\bfseries,
|
||||||
|
milestone label font=\small\itshape,
|
||||||
|
]{1}{14}
|
||||||
|
\gantttitle[]{Project Plan [weeks]}{14} \\
|
||||||
|
\gantttitlelist{1,...,14}{1} \\
|
||||||
|
|
||||||
|
\ganttgroup{Milestone 1: Core Pipeline}{1}{4} \\
|
||||||
|
\ganttbar{Understand SilkMoth}{1}{1} \\
|
||||||
|
\ganttbar{System design of core pipeline}{2}{2} \\
|
||||||
|
\ganttbar{Data collection/preparation}{2}{4} \\
|
||||||
|
\ganttbar{Tokenizer}{3}{4} \\
|
||||||
|
\ganttbar{Inverted Index}{3}{4} \\
|
||||||
|
\ganttbar{Signature Generator}{3}{4} \\
|
||||||
|
\ganttbar{Maximum Matching Verification}{3}{4} \\
|
||||||
|
\ganttmilestone{Milestone 1 done}{4} \\
|
||||||
|
|
||||||
|
\ganttgroup{Milestone 2: Extended Framework}{5}{8} \\
|
||||||
|
\ganttbar{Discovery Mode}{5}{6} \\
|
||||||
|
\ganttbar{Check Filter}{5}{6} \\
|
||||||
|
\ganttbar{Nearest Neighbor Filter}{6}{7} \\
|
||||||
|
\ganttbar{Triangle Optimization}{6}{7} \\
|
||||||
|
\ganttbar{Support for $\alpha \neq 0$}{6}{8}\\
|
||||||
|
\ganttbar{Edit Similarity}{7}{8}\\
|
||||||
|
\ganttbar{Prepare for Experiments}{7}{8}\\
|
||||||
|
\ganttbar{Prepare for expert review}{8}{8} \\
|
||||||
|
\ganttmilestone{Milestone 2 done}{8} \\
|
||||||
|
|
||||||
|
\ganttgroup{Milestone 3: Evaluation}{9}{14} \\
|
||||||
|
\ganttbar{Improve system using feedback}{9}{9} \\
|
||||||
|
\ganttbar{Experiments: Inclusion Dependency}{9}{12} \\
|
||||||
|
\ganttbar{Experiments: String Matching}{9}{12} \\
|
||||||
|
\ganttbar{Experiments: Schema Matching}{9}{12} \\
|
||||||
|
\ganttbar{(Bonus)}{11}{12} \\
|
||||||
|
\ganttbar[bar/.append style={fill=gray, solid}]{Finalize Visualization and Documentation}{12}{14} \\
|
||||||
|
\ganttbar[bar/.append style={fill=gray, solid}]{Preparing presentation}{13}{14} \\
|
||||||
|
\ganttmilestone{Milestone 4 done}{14} \\
|
||||||
|
\ganttmilestone{Project done}{14}
|
||||||
|
\end{ganttchart}
|
||||||
|
\caption{Implementation Plan. First week starting from 17.04.2025.}
|
||||||
|
\label{fig:plan}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\end{document}
|
||||||
8
docu/README.md
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
### Generating Documentation Page
|
||||||
|
|
||||||
|
To generate a [documentation page](https://berscjak.github.io/) from source code with mkdocs, run the following from root directory:
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install mkdocs mkdocstrings[python] mkdocs-awesome-pages-plugin
|
||||||
|
mkdocs serve
|
||||||
|
```
|
||||||
823
docu/demo_example.ipynb
Normal file
@@ -0,0 +1,823 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c9f89a47",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## SilkMoth Demo"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "2ca15800",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Related Set Discovery task under Set‑Containment using Jaccard Similarity"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ea6ce5fb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Import of all required modules:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 24,
|
||||||
|
"id": "bdd1b92c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import sys\n",
|
||||||
|
"sys.path.append(\"src\")\n",
|
||||||
|
"\n",
|
||||||
|
"from silkmoth.tokenizer import Tokenizer\n",
|
||||||
|
"from silkmoth.inverted_index import InvertedIndex\n",
|
||||||
|
"from silkmoth.signature_generator import SignatureGenerator\n",
|
||||||
|
"from silkmoth.candidate_selector import CandidateSelector\n",
|
||||||
|
"from silkmoth.verifier import Verifier\n",
|
||||||
|
"from silkmoth.silkmoth_engine import SilkMothEngine\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"from silkmoth.utils import jaccard_similarity, contain, edit_similarity, similar, SigType\n",
|
||||||
|
"\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"from IPython.display import display, Markdown\n",
|
||||||
|
"\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pandas as pd"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "bf6bf1f5",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Define example related dataset from \"SilkMoth\" paper (reference set **R** and source sets **S**)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 25,
|
||||||
|
"id": "598a4bbf",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Reference set (R):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[0]: “77 Mass Ave Boston MA”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[1]: “5th St 02115 Seattle WA”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- R[2]: “77 5th St Chicago IL”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Source sets (S):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Location Dataset\n",
|
||||||
|
"reference_set = [\n",
|
||||||
|
" '77 Mass Ave Boston MA',\n",
|
||||||
|
" '5th St 02115 Seattle WA',\n",
|
||||||
|
" '77 5th St Chicago IL'\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"# Address Dataset\n",
|
||||||
|
"source_sets = [\n",
|
||||||
|
" ['Mass Ave St Boston 02115','77 Mass 5th St Boston','77 Mass Ave 5th 02115'],\n",
|
||||||
|
" ['77 Boston MA','77 5th St Boston 02115','77 Mass Ave 02115 Seattle'],\n",
|
||||||
|
" ['77 Mass Ave 5th Boston MA','Mass Ave Chicago IL','77 Mass Ave St'],\n",
|
||||||
|
" ['77 Mass Ave MA','5th St 02115 Seattle WA','77 5th St Boston Seattle']\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"# thresholds & q\n",
|
||||||
|
"δ = 0.7\n",
|
||||||
|
"α = 0.0\n",
|
||||||
|
"q = 3\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Reference set (R):**\"))\n",
|
||||||
|
"for i, r in enumerate(reference_set):\n",
|
||||||
|
" display(Markdown(f\"- R[{i}]: “{r}”\"))\n",
|
||||||
|
"display(Markdown(\"**Source sets (S):**\"))\n",
|
||||||
|
"for j, S in enumerate(source_sets):\n",
|
||||||
|
" display(Markdown(f\"- S[{j}]: “{' | '.join(S)}”\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "a50b350a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 1. Tokenization\n",
|
||||||
|
"Tokenize each element of R and each S using Jaccard Similarity (whitespace tokens)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 26,
|
||||||
|
"id": "55e7b5d0",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Tokenized Reference set (R):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[0]: {'Ave', 'MA', '77', 'Boston', 'Mass'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[1]: {'5th', 'Seattle', 'St', 'WA', '02115'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of R[2]: {'77', '5th', 'IL', 'St', 'Chicago'}"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Tokenized Source sets (S):**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[0]: [{'Ave', 'Boston', 'St', 'Mass', '02115'}, {'77', 'Boston', '5th', 'St', 'Mass'}, {'Ave', '77', '5th', 'Mass', '02115'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[1]: [{'Boston', 'MA', '77'}, {'77', 'Boston', '5th', 'St', '02115'}, {'Ave', '77', 'Seattle', 'Mass', '02115'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[2]: [{'Ave', 'MA', '77', 'Boston', '5th', 'Mass'}, {'IL', 'Ave', 'Mass', 'Chicago'}, {'St', 'Ave', 'Mass', '77'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Tokens of S[3]: [{'Ave', 'Mass', '77', 'MA'}, {'5th', 'Seattle', 'St', 'WA', '02115'}, {'77', 'Boston', '5th', 'Seattle', 'St'}]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"tokenizer = Tokenizer(jaccard_similarity, q)\n",
|
||||||
|
"tokenized_R = tokenizer.tokenize(reference_set)\n",
|
||||||
|
"tokenized_S = [tokenizer.tokenize(S) for S in source_sets]\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Tokenized Reference set (R):**\"))\n",
|
||||||
|
"for i, toks in enumerate(tokenized_R):\n",
|
||||||
|
" display(Markdown(f\"- Tokens of R[{i}]: {toks}\"))\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"**Tokenized Source sets (S):**\"))\n",
|
||||||
|
"for i, toks in enumerate(tokenized_S):\n",
|
||||||
|
" display(Markdown(f\"- Tokens of S[{i}]: {toks}\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "e17b807b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 2. Build Inverted Index\n",
|
||||||
|
"Builds an inverted index on the tokenized source sets and shows an example lookup."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 27,
|
||||||
|
"id": "22c7d1d6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Index built over 4 source sets."
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Example: token “Mass” appears in [(0, 0), (0, 1), (0, 2), (1, 2), (2, 0), (2, 1), (2, 2), (3, 0)]"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"index = InvertedIndex(tokenized_S)\n",
|
||||||
|
"display(Markdown(f\"- Index built over {len(source_sets)} source sets.\"))\n",
|
||||||
|
"display(Markdown(f\"- Example: token “Mass” appears in {index.get_indexes('Mass')}\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "cc17daac",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 3. Signature Generation"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "1c48bac2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Generates the weighted signature for R given δ, α (here α=0), using Jaccard Similarity."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 28,
|
||||||
|
"id": "a36be65c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Selected signature tokens: **['Chicago', 'WA', 'IL', '5th']**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"sig_gen = SignatureGenerator()\n",
|
||||||
|
"signature = sig_gen.get_signature(\n",
|
||||||
|
" tokenized_R, index,\n",
|
||||||
|
" delta=δ, alpha=α,\n",
|
||||||
|
" sig_type=SigType.WEIGHTED,\n",
|
||||||
|
" sim_fun=jaccard_similarity,\n",
|
||||||
|
" q=q\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"- Selected signature tokens: **{signature}**\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "938be3e2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 4. Initial Candidate Selection\n",
|
||||||
|
"\n",
|
||||||
|
"Looks up each signature token in the inverted index to form the candidate set.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 29,
|
||||||
|
"id": "58017e27",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Candidate set indices: **[0, 1, 2, 3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"cand_sel = CandidateSelector(\n",
|
||||||
|
" similarity_func=jaccard_similarity,\n",
|
||||||
|
" sim_metric=contain,\n",
|
||||||
|
" related_thresh=δ,\n",
|
||||||
|
" sim_thresh=α,\n",
|
||||||
|
" q=q\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"initial_cands = cand_sel.get_candidates(signature, index, len(tokenized_R))\n",
|
||||||
|
"display(Markdown(f\"- Candidate set indices: **{sorted(initial_cands)}**\"))\n",
|
||||||
|
"for j in sorted(initial_cands):\n",
|
||||||
|
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d633e5f9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 5. Check Filter\n",
|
||||||
|
"Prunes candidates by ensuring each matched element passes the local similarity bound.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 30,
|
||||||
|
"id": "9a2bfdeb",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"**Surviving after check filter:** **[0, 1, 3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[0] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **0.429** | Matched elements: **1**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[1] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **0.429** | Matched elements: **1**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"S[3] matched:"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[1] “5th St 02115 Seattle WA” → sim = 1.000"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • R[2] “77 5th St Chicago IL” → sim = 0.429"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" → Best sim: **1.000** | Matched elements: **2**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"filtered_cands, match_map = cand_sel.check_filter(\n",
|
||||||
|
" tokenized_R, set(signature), initial_cands, index\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"**Surviving after check filter:** **{sorted(filtered_cands)}**\"))\n",
|
||||||
|
"for j in sorted(filtered_cands):\n",
|
||||||
|
" display(Markdown(f\"S[{j}] matched:\"))\n",
|
||||||
|
" for r_idx, sim in match_map[j].items():\n",
|
||||||
|
" sim_text = f\"{sim:.3f}\"\n",
|
||||||
|
" display(Markdown(f\" • R[{r_idx}] “{reference_set[r_idx]}” → sim = {sim_text}\"))\n",
|
||||||
|
" \n",
|
||||||
|
" matches = match_map.get(j, {})\n",
|
||||||
|
" if matches:\n",
|
||||||
|
" best_sim = max(matches.values())\n",
|
||||||
|
" num_matches = len(matches)\n",
|
||||||
|
" display(Markdown(f\" → Best sim: **{best_sim:.3f}** | Matched elements: **{num_matches}**\"))\n",
|
||||||
|
" else:\n",
|
||||||
|
" display(Markdown(f\"No elements passed similarity checks.\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "cc37bb7f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 6. Nearest‑Neighbor Filter\n",
|
||||||
|
"\n",
|
||||||
|
"Further prunes via nearest‑neighbor upper bounds on total matching score.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 31,
|
||||||
|
"id": "aa9b7a63",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"- Surviving after NN filter: **[3]**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"nn_filtered = cand_sel.nn_filter(\n",
|
||||||
|
" tokenized_R, set(signature), filtered_cands,\n",
|
||||||
|
" index, threshold=δ, match_map=match_map\n",
|
||||||
|
")\n",
|
||||||
|
"display(Markdown(f\"- Surviving after NN filter: **{sorted(nn_filtered)}**\"))\n",
|
||||||
|
"for j in nn_filtered:\n",
|
||||||
|
" display(Markdown(f\" - S[{j}]: “{' | '.join(source_sets[j])}”\"))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "8638f83a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### 7. Verification\n",
|
||||||
|
"\n",
|
||||||
|
"Runs the bipartite max‑matching on the remaining candidates and outputs the final related sets.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 32,
|
||||||
|
"id": "ebdf20fe",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
"Final related sets (score ≥ 0.7):"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/markdown": [
|
||||||
|
" • S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle” → **0.743**"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<IPython.core.display.Markdown object>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"verifier = Verifier(δ, contain, jaccard_similarity, sim_thresh=α, reduction=False)\n",
|
||||||
|
"results = verifier.get_related_sets(tokenized_R, nn_filtered, index)\n",
|
||||||
|
"\n",
|
||||||
|
"if results:\n",
|
||||||
|
" display(Markdown(f\"Final related sets (score ≥ {δ}):\"))\n",
|
||||||
|
" for j, score in results:\n",
|
||||||
|
" display(Markdown(f\" • S[{j}]: “{' | '.join(source_sets[j])}” → **{score:.3f}**\"))\n",
|
||||||
|
"else:\n",
|
||||||
|
" display(Markdown(\"- No sets passed verification.\"))\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "silkmoth_env",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.13"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
155
docu/experiments/README.md
Normal file
@@ -0,0 +1,155 @@
|
|||||||
|
### 🧪 Running the Experiments
|
||||||
|
|
||||||
|
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📊 1. Experiment Types
|
||||||
|
|
||||||
|
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
|
||||||
|
|
||||||
|
- **String Matching (DBLP Publication Titles)**
|
||||||
|
- **Schema Matching (WebTables)**
|
||||||
|
- **Inclusion Dependency Discovery (WebTable Columns)**
|
||||||
|
|
||||||
|
Exact descriptions can be found in the official paper.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📦 2. WebSchema Inclusion Dependency Setup
|
||||||
|
|
||||||
|
To run the **WebSchema + Inclusion Dependency** experiments:
|
||||||
|
|
||||||
|
1. Download the pre-extracted dataset from
|
||||||
|
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
|
||||||
|
2. Place the `.json` files in the `data/webtables/` directory
|
||||||
|
*(create the folder if it does not exist)*.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 🚀 3. Running the Experiments
|
||||||
|
|
||||||
|
To execute the core experiments from the paper:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 📈 4. Results Overview
|
||||||
|
|
||||||
|
We compared our results with those presented in the original SilkMoth paper.
|
||||||
|
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
|
||||||
|
|
||||||
|
All the results can be found in the folder `results`.
|
||||||
|
|
||||||
|
The **left** diagrams are from the paper and the **right** are ours.
|
||||||
|
|
||||||
|
> 💡 *Recent performance enhancements leverage `scipy`’s C-accelerated matching, replacing the original `networkx`-based approach.
|
||||||
|
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Inclusion Dependency
|
||||||
|
|
||||||
|
> **Goal**: Check if each reference set is contained within source sets.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Reduction Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Schema Matching (WebTables)
|
||||||
|
|
||||||
|
> **Goal**: Detect related set pairs within a single source set.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 String Matching (DBLP Publication Titles)
|
||||||
|
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
|
||||||
|
based on **edit similarity** and **q-gram** tokenization.
|
||||||
|
> SciPy was used here.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/10k-set-size/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
|
||||||
|
|
||||||
|
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
|
||||||
|
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
|
||||||
|
> shows the Filter run alongside the brute-force bipartite matching method without any
|
||||||
|
> optimization pipeline. The results clearly demonstrate a dramatic improvement
|
||||||
|
> in runtime efficiency when using SilkMoth.
|
||||||
|
|
||||||
|
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Additional: Schema Matching with GitHub WebTables
|
||||||
|
|
||||||
|
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoth’s performance across different domains.
|
||||||
|
**Left:** Matching with one reference set.
|
||||||
|
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
|
||||||
|
The results show no significant difference, indicating consistent behavior across varying datasets.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
After Width: | Height: | Size: 125 KiB |
|
After Width: | Height: | Size: 151 KiB |
|
After Width: | Height: | Size: 166 KiB |
|
After Width: | Height: | Size: 241 KiB |
|
After Width: | Height: | Size: 207 KiB |
64
docu/experiments/results/plot.py
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
from experiments.utils import plot_elapsed_times
|
||||||
|
import csv
|
||||||
|
|
||||||
|
import csv
|
||||||
|
|
||||||
|
labels = []
|
||||||
|
elapsed_times = []
|
||||||
|
|
||||||
|
def read_csv_add_data(filename, labels, elapsed_times):
|
||||||
|
with open(filename, newline='') as csvfile:
|
||||||
|
reader = csv.reader(csvfile)
|
||||||
|
next(reader) # skip header
|
||||||
|
times = []
|
||||||
|
current_label = None
|
||||||
|
for row in reader:
|
||||||
|
sim_thresh = float(row[0])
|
||||||
|
label = row[4]
|
||||||
|
elapsed = float(row[5])
|
||||||
|
|
||||||
|
if sim_thresh == 0.5:
|
||||||
|
if current_label != label:
|
||||||
|
# New label group started
|
||||||
|
if times:
|
||||||
|
# Save times of previous label if not empty
|
||||||
|
elapsed_times.append(times)
|
||||||
|
times = [elapsed]
|
||||||
|
current_label = label
|
||||||
|
else:
|
||||||
|
times.append(elapsed)
|
||||||
|
|
||||||
|
# When 4 times collected, append and reset
|
||||||
|
if len(times) == 4:
|
||||||
|
elapsed_times.append(times)
|
||||||
|
times = []
|
||||||
|
current_label = None
|
||||||
|
|
||||||
|
if label not in labels:
|
||||||
|
labels.append(label)
|
||||||
|
|
||||||
|
# In case last label times were not appended
|
||||||
|
if times:
|
||||||
|
elapsed_times.append(times)
|
||||||
|
|
||||||
|
# Read first CSV
|
||||||
|
read_csv_add_data('inclusion_dependency/raw_matching_experiment_results.csv', labels, elapsed_times)
|
||||||
|
|
||||||
|
# Read second CSV
|
||||||
|
read_csv_add_data('inclusion_dependency/inclusion_dependency_filter_experiment_results.csv', labels, elapsed_times)
|
||||||
|
|
||||||
|
print("Labels:", labels)
|
||||||
|
print("Elapsed Times:", elapsed_times)
|
||||||
|
|
||||||
|
# Then plot
|
||||||
|
file_name_prefix = "inclusion_dependency_filter_combined_raw"
|
||||||
|
folder_path = ""
|
||||||
|
|
||||||
|
_ = plot_elapsed_times(
|
||||||
|
related_thresholds=[0.7, 0.75, 0.8, 0.85],
|
||||||
|
elapsed_times_list=elapsed_times,
|
||||||
|
fig_text=f"{file_name_prefix} (α = 0.5)",
|
||||||
|
legend_labels=labels,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_α=0.5.png"
|
||||||
|
)
|
||||||
|
|
||||||
|
After Width: | Height: | Size: 171 KiB |
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 188 KiB |
|
After Width: | Height: | Size: 248 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 159 KiB |
|
After Width: | Height: | Size: 199 KiB |
|
After Width: | Height: | Size: 221 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_filter.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_red.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_scal.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_sig.png
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_filter.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_scal.png
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_sig.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_filter.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_scal.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_sig.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
docu/figures/InvertedIndex.png
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
docu/figures/Pipeline.png
Normal file
|
After Width: | Height: | Size: 230 KiB |
151
docu/index.md
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
# 🦋 LSDIPro SS2025
|
||||||
|
|
||||||
|
## 📄 [SilkMoth: An Efficient Method for Finding Related Sets](https://doi.org/10.14778/3115404.3115413)
|
||||||
|
|
||||||
|
A project inspired by the SilkMoth paper, exploring efficient techniques for related set discovery.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 👥 Team Members
|
||||||
|
- **Andreas Wilms**
|
||||||
|
- **Sarra Daknou**
|
||||||
|
- **Amina Iqbal**
|
||||||
|
- **Jakob Berschneider**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Experiments & Results
|
||||||
|
➡️ [**See Experiments**](experiments/README.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 Interactive Demo
|
||||||
|
|
||||||
|
Follow our **step-by-step Jupyter Notebook demo** for a hands-on understanding of SilkMoth
|
||||||
|
|
||||||
|
📓 [**Open demo_example.ipynb**](demo_example.ipynb)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [1. Large Scale Data Integration Project (LSDIPro)](#1-large-scale-data-integration-project-lsdipro)
|
||||||
|
- [2. What is SilkMoth? 🐛](#2-what-is-silkmoth)
|
||||||
|
- [3. The Problem 🧩](#3-the-problem)
|
||||||
|
- [4. SilkMoth’s Solution 🚀](#4-silkmoths-solution)
|
||||||
|
- [5. Core Pipeline Steps 🔁](#5-core-pipeline-steps)
|
||||||
|
- [5.1 Tokenization](#51-tokenization)
|
||||||
|
- [5.2 Inverted Index Construction](#52-inverted-index-construction)
|
||||||
|
- [5.3 Signature Generation](#53-signature-generation)
|
||||||
|
- [5.4 Candidate Selection](#54-candidate-selection)
|
||||||
|
- [5.5 Refinement Filters](#55-refinement-filters)
|
||||||
|
- [5.6 Verification via Maximum Matching](#56-verification-via-maximum-matching)
|
||||||
|
- [6. Modes of Operation 🧪](#6-modes-of-operation-)
|
||||||
|
- [7. Supported Similarity Functions 📐](#7-supported-similarity-functions-)
|
||||||
|
- [8. Installing from Source](#8-installing-from-source)
|
||||||
|
- [9. Experiment Results](#9-experiment-results)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Large Scale Data Integration Project (LSDIPro)
|
||||||
|
|
||||||
|
As part of the university project LSDIPro, our team implemented the SilkMoth paper in Python.
|
||||||
|
The course focuses on large-scale data integration, where student groups reproduce and extend research prototypes.
|
||||||
|
The project emphasizes scalable algorithm design, evaluation, and handling heterogeneous data at scale.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. What is SilkMoth?
|
||||||
|
|
||||||
|
**SilkMoth** is a system designed to efficiently discover related sets in large collections of data, even when the elements within those sets are only approximately similar.
|
||||||
|
This is especially important in **data integration**, **data cleaning**, and **information retrieval**, where messy or inconsistent data is common.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The Problem
|
||||||
|
|
||||||
|
Determining whether two sets are related, for example, whether two database columns should be joined, often involves comparing their elements using **similarity functions** (not just exact matches).
|
||||||
|
A powerful approach models this as a **bipartite graph** and finds the **maximum matching score** between elements. However, this method is **computationally expensive** (`O(n³)` per pair), making it impractical for large datasets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. SilkMoth’s Solution
|
||||||
|
|
||||||
|
SilkMoth tackles this with a three-step approach:
|
||||||
|
|
||||||
|
1. **Signature Generation**: Creates compact signatures for each set, ensuring related sets share signature parts.
|
||||||
|
2. **Pruning**: Filters out unrelated sets early, reducing candidates.
|
||||||
|
3. **Verification**: Applies the costly matching metric only on remaining candidates, matching brute-force accuracy but faster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Core Pipeline Steps
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
*Figure 1. SILKMOTH pipeline framework. Source: Deng et al., "SILKMOTH: An Efficient Method for Finding Related Sets with Maximum Matching Constraints", VLDB 2017. Licensed under CC BY-NC-ND 4.0.*
|
||||||
|
|
||||||
|
### [5.1 Tokenization](pages/tokenizer.md)
|
||||||
|
|
||||||
|
Each element in every set is tokenized based on the selected similarity function:
|
||||||
|
- **Jaccard Similarity**: Elements are split into whitespace-delimited tokens.
|
||||||
|
- **Edit Similarity**: Elements are split into overlapping `q`-grams (e.g., 3-grams).
|
||||||
|
|
||||||
|
### [5.2 Inverted Index Construction](pages/inverted_index.md)
|
||||||
|
|
||||||
|
An **inverted index** is built from the reference set `R` to map each token to a list of `(set, element)` pairs in which it occurs.
|
||||||
|
This allows fast lookup of candidate sets sharing tokens with a query.
|
||||||
|
|
||||||
|
### [5.3 Signature Generation](pages/signature_generator.md)
|
||||||
|
|
||||||
|
A **signature** is a subset of tokens selected from each set such that:
|
||||||
|
- Any related set must share at least one signature token.
|
||||||
|
- Signature size is minimized to reduce candidate space.
|
||||||
|
|
||||||
|
Signature selection heuristics (e.g., cost/value greedy ranking) approximate the optimal valid signature, which is NP-complete to compute exactly.
|
||||||
|
|
||||||
|
### [5.4 Candidate Selection](pages/candidate_selector.md)
|
||||||
|
|
||||||
|
For each set `R`, retrieve from the inverted index all sets `S` sharing at least one token with `R`’s signature. These become **candidate sets** for further evaluation.
|
||||||
|
|
||||||
|
### [5.5 Refinement Filters](pages/candidate_selector.md)
|
||||||
|
|
||||||
|
Two filters reduce false positives among candidates:
|
||||||
|
- **Check Filter**: Uses an upper bound on similarity to eliminate sets below threshold.
|
||||||
|
- **Nearest Neighbor Filter**: Approximates maximum matching score using nearest neighbor similarity for each element in `R`.
|
||||||
|
|
||||||
|
### [5.6 Verification via Maximum Matching](pages/verifier.md)
|
||||||
|
|
||||||
|
Compute **maximum weighted bipartite matching** between elements of `R` and `S` for remaining candidates using the similarity function as edge weights.
|
||||||
|
Sets meeting or exceeding threshold `δ` are considered **related**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Modes of Operation 🧪
|
||||||
|
|
||||||
|
- **Discovery Mode**: Compare all pairs of sets to find all related pairs.
|
||||||
|
*Use case:* Finding related columns in databases.
|
||||||
|
|
||||||
|
- **Search Mode**: Given a reference set, find all related sets.
|
||||||
|
*Use case:* Schema matching or entity deduplication.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Supported Similarity Functions 📐
|
||||||
|
|
||||||
|
- **Jaccard Similarity**
|
||||||
|
- **Edit Similarity** (Levenshtein-based)
|
||||||
|
- Optional minimum similarity threshold `α` on element comparisons.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Installing from Source
|
||||||
|
|
||||||
|
1. Run `pip install src/` to install
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## 9. Experiment Results
|
||||||
|
|
||||||
|
[📊 See Experiments and Results](experiments/README.md)
|
||||||
4
docu/pages/candidate_selector.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.candidate_selector
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/inverted_index.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.inverted_index
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/signature_generator.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.signature_generator
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/silkmoth_engine.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.silkmoth_engine
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/tokenizer.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.tokenizer
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/utils.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.utils
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
4
docu/pages/verifier.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
::: silkmoth.verifier
|
||||||
|
rendering:
|
||||||
|
show_signature: true
|
||||||
|
show_source: true
|
||||||
20
docu/write_modules.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
import glob, os
|
||||||
|
|
||||||
|
MODULES = glob.glob("src/silkmoth/*.py")
|
||||||
|
OUT_DIR = "docu/pages"
|
||||||
|
|
||||||
|
os.makedirs(OUT_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
for path in MODULES:
|
||||||
|
name = os.path.splitext(os.path.basename(path))[0]
|
||||||
|
if name == "__init__":
|
||||||
|
continue
|
||||||
|
|
||||||
|
doc_path = os.path.join(OUT_DIR, f"{name}.md")
|
||||||
|
with open(doc_path, "w") as f:
|
||||||
|
f.write("::: silkmoth." + name + "\n")
|
||||||
|
f.write(" rendering:\n")
|
||||||
|
f.write(" show_signature: true\n")
|
||||||
|
f.write(" show_source: true\n")
|
||||||
|
|
||||||
|
|
||||||
155
experiments/README.md
Normal file
@@ -0,0 +1,155 @@
|
|||||||
|
### 🧪 Running the Experiments
|
||||||
|
|
||||||
|
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📊 1. Experiment Types
|
||||||
|
|
||||||
|
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
|
||||||
|
|
||||||
|
- **String Matching (DBLP Publication Titles)**
|
||||||
|
- **Schema Matching (WebTables)**
|
||||||
|
- **Inclusion Dependency Discovery (WebTable Columns)**
|
||||||
|
|
||||||
|
Exact descriptions can be found in the official paper.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📦 2. WebSchema Inclusion Dependency Setup
|
||||||
|
|
||||||
|
To run the **WebSchema + Inclusion Dependency** experiments:
|
||||||
|
|
||||||
|
1. Download the pre-extracted dataset from
|
||||||
|
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
|
||||||
|
2. Place the `.json` files in the `data/webtables/` directory
|
||||||
|
*(create the folder if it does not exist)*.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 🚀 3. Running the Experiments
|
||||||
|
|
||||||
|
To execute the core experiments from the paper:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 📈 4. Results Overview
|
||||||
|
|
||||||
|
We compared our results with those presented in the original SilkMoth paper.
|
||||||
|
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
|
||||||
|
|
||||||
|
All the results can be found in the folder `results`.
|
||||||
|
|
||||||
|
The **left** diagrams are from the paper and the **right** are ours.
|
||||||
|
|
||||||
|
> 💡 *Recent performance enhancements leverage `scipy`’s C-accelerated matching, replacing the original `networkx`-based approach.
|
||||||
|
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Inclusion Dependency
|
||||||
|
|
||||||
|
> **Goal**: Check if each reference set is contained within source sets.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Reduction Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Schema Matching (WebTables)
|
||||||
|
|
||||||
|
> **Goal**: Detect related set pairs within a single source set.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 String Matching (DBLP Publication Titles)
|
||||||
|
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
|
||||||
|
based on **edit similarity** and **q-gram** tokenization.
|
||||||
|
> SciPy was used here.
|
||||||
|
|
||||||
|
**Filter Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Signature Comparison**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
**Scalability**
|
||||||
|
<p align="center">
|
||||||
|
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
|
||||||
|
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||||
|
</p>
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
|
||||||
|
|
||||||
|
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
|
||||||
|
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
|
||||||
|
> shows the Filter run alongside the brute-force bipartite matching method without any
|
||||||
|
> optimization pipeline. The results clearly demonstrate a dramatic improvement
|
||||||
|
> in runtime efficiency when using SilkMoth.
|
||||||
|
|
||||||
|
|
||||||
|
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔍 Additional: Schema Matching with GitHub WebTables
|
||||||
|
|
||||||
|
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoth’s performance across different domains.
|
||||||
|
**Left:** Matching with one reference set.
|
||||||
|
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
|
||||||
|
The results show no significant difference, indicating consistent behavior across varying datasets.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
|
||||||
|
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
|
||||||
|
</p>
|
||||||
0
experiments/data/__init__.py
Normal file
132466
experiments/data/dblp/DBLP_100k.csv
Normal file
0
experiments/data/webtables/__init__.py
Normal file
174
experiments/data_loader.py
Normal file
@@ -0,0 +1,174 @@
|
|||||||
|
import random
|
||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
from utils import *
|
||||||
|
|
||||||
|
|
||||||
|
class DataLoader:
|
||||||
|
def __init__(self, data_path):
|
||||||
|
self.data_path = data_path
|
||||||
|
self.files = os.listdir(data_path)
|
||||||
|
|
||||||
|
def load_webtable_columns_randomized(self, reference_set_amount: int, source_set_amount: int) -> tuple[list, list]:
|
||||||
|
"""
|
||||||
|
Get randomized reference sets and source sets of webtable columns.
|
||||||
|
Reference sets are subsets of the source sets.
|
||||||
|
Only columns with 4 or more different elements are considered.
|
||||||
|
Only considering columns with non-numeric values.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
reference_set_amount (int): Number of reference sets to return.
|
||||||
|
source_set_amount (int): Number of source sets to return.
|
||||||
|
Returns:
|
||||||
|
tuple: A tuple containing a list of reference sets and a list of source sets.
|
||||||
|
"""
|
||||||
|
# Basic validation of input parameters
|
||||||
|
if reference_set_amount < 1 or source_set_amount < 2:
|
||||||
|
raise ValueError("reference_set_amount must be at least 1 and source_set_amount must be at least 2")
|
||||||
|
if reference_set_amount >= source_set_amount:
|
||||||
|
raise ValueError("reference_set_amount must be smaller than source_set_amount")
|
||||||
|
if reference_set_amount > len(self.files):
|
||||||
|
raise ValueError("reference_set_amount must be smaller than the number of files in data_path")
|
||||||
|
if source_set_amount > len(self.files):
|
||||||
|
raise ValueError("source_set_amount must be smaller than the number of files in data_path")
|
||||||
|
if len(self.files) == 0:
|
||||||
|
raise ValueError("data_path does not contain any files")
|
||||||
|
|
||||||
|
|
||||||
|
# Randomly select a reference set and source sets
|
||||||
|
source_set_nums = random.sample(range(len(self.files)), source_set_amount)
|
||||||
|
|
||||||
|
# Pick source_set_amount of columns which have at least 4 different elements
|
||||||
|
source_sets = []
|
||||||
|
while len(source_sets) < source_set_amount:
|
||||||
|
# Pick a random number from the source_set_nums
|
||||||
|
source_set_num = random.choice(source_set_nums)
|
||||||
|
file_path = os.path.join(self.data_path, self.files[source_set_num])
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as file:
|
||||||
|
json_data = json.load(file)
|
||||||
|
if "relation" in json_data and isinstance(json_data["relation"], list):
|
||||||
|
# pick random column
|
||||||
|
col = random.randint(0, len(json_data["relation"]) - 1)
|
||||||
|
col = json_data["relation"][col]
|
||||||
|
|
||||||
|
# Check if the column has at least 4 different elements and contains no numeric values
|
||||||
|
if len(set(col)) >= 4:
|
||||||
|
if all(not is_convertible_to_number(value) and len(value) > 0 for value in col):
|
||||||
|
# Add the column to the source sets
|
||||||
|
source_sets.append(col)
|
||||||
|
print(f"Source set number {len(source_sets)} loaded")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
raise ValueError(f"Error loading JSON file: {e}")
|
||||||
|
|
||||||
|
# Randomly select reference sets from the source sets
|
||||||
|
reference_sets = random.sample(source_sets, reference_set_amount)
|
||||||
|
return reference_sets, source_sets
|
||||||
|
|
||||||
|
def load_webtable_reference_sets_element_restriction(self, source_set: list, element_restriction: int) -> list:
|
||||||
|
"""
|
||||||
|
Get a reference set of webtable columns with a specific element restriction.
|
||||||
|
Restriction is the minimal number of elements allowed in the reference set.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
source_set (list): The source set to use for generating the reference set.
|
||||||
|
element_restriction (int): The number of elements in the reference set.
|
||||||
|
Returns:
|
||||||
|
list: A list of reference sets.
|
||||||
|
"""
|
||||||
|
if element_restriction < 1:
|
||||||
|
raise ValueError("element_restriction must be at least 1")
|
||||||
|
|
||||||
|
reference_sets = []
|
||||||
|
|
||||||
|
while len(reference_sets) < 1000:
|
||||||
|
# Randomly select a column from the source set
|
||||||
|
col = random.choice(source_set)
|
||||||
|
|
||||||
|
# Check if the column has at least element_restriction different elements
|
||||||
|
if len(col) >= element_restriction:
|
||||||
|
reference_sets.append(col)
|
||||||
|
print(f"Reference set number {len(reference_sets)} loaded")
|
||||||
|
|
||||||
|
return reference_sets
|
||||||
|
|
||||||
|
def load_webtable_schemas_randomized(self, set_amount: int) -> list:
|
||||||
|
if set_amount < 2:
|
||||||
|
raise ValueError("source_set_amount must be at least 2")
|
||||||
|
# Random sequence of table numbers
|
||||||
|
table_nums = random.sample(range(len(self.files)), len(self.files))
|
||||||
|
|
||||||
|
schema_sets = []
|
||||||
|
|
||||||
|
i = 0
|
||||||
|
while len(schema_sets) < set_amount and i < len(table_nums):
|
||||||
|
try:
|
||||||
|
# Load the schema for the current table number
|
||||||
|
schema = self.load_single_webtable_schema(table_nums[i])
|
||||||
|
schema_sets.append(schema)
|
||||||
|
print(f"Schema set number {len(schema_sets)} loaded")
|
||||||
|
i += 1
|
||||||
|
except ValueError as e:
|
||||||
|
print(f"Skipping table number {table_nums[i]} due to error: {e}")
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
return schema_sets
|
||||||
|
|
||||||
|
def load_single_webtable_schema(self, reference_set_num: int) -> list:
|
||||||
|
# Load the webtable schema for the given reference set number
|
||||||
|
if reference_set_num < 0 or reference_set_num >= len(self.files):
|
||||||
|
raise IndexError("reference_set_num is out of range")
|
||||||
|
|
||||||
|
# Get the file at the specified position
|
||||||
|
file_path = os.path.join(self.data_path, self.files[reference_set_num])
|
||||||
|
|
||||||
|
# Load and return the JSON content
|
||||||
|
try:
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as file:
|
||||||
|
json_data = json.load(file)
|
||||||
|
if "relation" in json_data and isinstance(json_data["relation"], list):
|
||||||
|
schema = [relation[0] for relation in json_data["relation"]]
|
||||||
|
if len(schema) == 0:
|
||||||
|
raise ValueError("Schema is empty")
|
||||||
|
|
||||||
|
if all(not is_convertible_to_number(col) for col in schema):
|
||||||
|
# remove "" empty strings from the schema
|
||||||
|
schema = [col for col in schema if len(col) > 0]
|
||||||
|
if len(schema) == 0:
|
||||||
|
raise ValueError("Schema contains only empty strings")
|
||||||
|
return schema
|
||||||
|
else:
|
||||||
|
raise ValueError("Schema contains numeric values or is empty")
|
||||||
|
else:
|
||||||
|
raise ValueError("JSON does not contain a valid 'relation' key or it is not a list")
|
||||||
|
except Exception as e:
|
||||||
|
raise ValueError(f"Error loading JSON file: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def load_dblp_titles(self, data_path: str) -> list:
|
||||||
|
"""
|
||||||
|
Load DBLP paper titles from a CSV file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data_path (str): Path to CSV file containing a column 'title'.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
list: A list of title strings.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if not os.path.exists(data_path):
|
||||||
|
raise FileNotFoundError(f"DBLP CSV file not found: {data_path}")
|
||||||
|
|
||||||
|
df = pd.read_csv(data_path)
|
||||||
|
if "title" not in df.columns:
|
||||||
|
raise ValueError("CSV must contain a 'title' column")
|
||||||
|
|
||||||
|
titles = df["title"].dropna().tolist()
|
||||||
|
return titles
|
||||||
|
|
||||||
|
|
||||||
469
experiments/experiments.py
Normal file
@@ -0,0 +1,469 @@
|
|||||||
|
import time
|
||||||
|
from math import floor
|
||||||
|
|
||||||
|
from silkmoth.silkmoth_engine import SilkMothEngine
|
||||||
|
from silkmoth.utils import SigType, edit_similarity, contain, jaccard_similarity
|
||||||
|
from silkmoth.verifier import Verifier
|
||||||
|
from silkmoth.tokenizer import Tokenizer
|
||||||
|
from src.silkmoth.silkmoth_engine import SilkMothEngine
|
||||||
|
from src.silkmoth.utils import SigType, edit_similarity
|
||||||
|
from utils import *
|
||||||
|
|
||||||
|
|
||||||
|
def run_experiment_filter_schemes(related_thresholds, similarity_thresholds, labels, source_sets, reference_sets,
|
||||||
|
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
related_thresholds : list[float]
|
||||||
|
Thresholds for determining relatedness between sets.
|
||||||
|
similarity_thresholds : list[float]
|
||||||
|
Thresholds for measuring similarity between sets.
|
||||||
|
labels : list[str]
|
||||||
|
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
|
||||||
|
source_sets : list[]
|
||||||
|
The sets to be compared against the reference sets or against itself.
|
||||||
|
reference_sets : list[]
|
||||||
|
The sets used as the reference for comparison.
|
||||||
|
sim_metric : callable
|
||||||
|
The metric function used to evaluate similarity between sets.
|
||||||
|
sim_func : callable
|
||||||
|
The function used to calculate similarity scores.
|
||||||
|
is_search : bool
|
||||||
|
Flag indicating whether to perform a search operation or discovery.
|
||||||
|
file_name_prefix : str
|
||||||
|
Prefix for naming output files generated during the experiment.
|
||||||
|
folder_path: str
|
||||||
|
Path to the folder where results will be saved.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Calculate index time and RAM usage for the SilkMothEngine
|
||||||
|
in_index_time_start = time.time()
|
||||||
|
initial_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
# Initialize and run the SilkMothEngine
|
||||||
|
silk_moth_engine = SilkMothEngine(
|
||||||
|
related_thresh=0,
|
||||||
|
source_sets=source_sets,
|
||||||
|
sim_metric=sim_metric,
|
||||||
|
sim_func=sim_func,
|
||||||
|
sim_thresh=0,
|
||||||
|
is_check_filter=False,
|
||||||
|
is_nn_filter=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
in_index_time_end = time.time()
|
||||||
|
final_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||||
|
in_index_ram_usage = final_ram - initial_ram
|
||||||
|
|
||||||
|
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||||
|
|
||||||
|
for sim_thresh in similarity_thresholds:
|
||||||
|
|
||||||
|
# Check if the similarity function is edit similarity
|
||||||
|
if sim_func == edit_similarity:
|
||||||
|
# calc the maximum possible q-gram size based on sim_thresh
|
||||||
|
upper_bound_q = sim_thresh/(1 - sim_thresh)
|
||||||
|
q = floor(upper_bound_q)
|
||||||
|
|
||||||
|
print(f"Using q = {q} for edit similarity with sim_thresh = {sim_thresh}")
|
||||||
|
print(f"Rebuilding Inverted Index with q = {q}...")
|
||||||
|
silk_moth_engine.set_q(q)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
elapsed_times_final = []
|
||||||
|
silk_moth_engine.set_alpha(sim_thresh)
|
||||||
|
for label in labels:
|
||||||
|
|
||||||
|
elapsed_times = []
|
||||||
|
for idx, related_thresh in enumerate(related_thresholds):
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"\nRunning SilkMoth {file_name_prefix} with α = {sim_thresh}, θ = {related_thresh}, label = {label}")
|
||||||
|
|
||||||
|
# checks for filter runs
|
||||||
|
if label == "CHECK FILTER":
|
||||||
|
silk_moth_engine.is_check_filter = True
|
||||||
|
silk_moth_engine.is_nn_filter = False
|
||||||
|
elif label == "NN FILTER":
|
||||||
|
silk_moth_engine.is_check_filter = False
|
||||||
|
silk_moth_engine.is_nn_filter = True
|
||||||
|
else: # NO FILTER
|
||||||
|
silk_moth_engine.is_check_filter = False
|
||||||
|
silk_moth_engine.is_nn_filter = False
|
||||||
|
|
||||||
|
# checks for signature scheme runs
|
||||||
|
if label == SigType.WEIGHTED:
|
||||||
|
silk_moth_engine.set_signature_type(SigType.WEIGHTED)
|
||||||
|
elif label == SigType.SKYLINE:
|
||||||
|
silk_moth_engine.set_signature_type(SigType.SKYLINE)
|
||||||
|
elif label == SigType.DICHOTOMY:
|
||||||
|
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
|
||||||
|
|
||||||
|
silk_moth_engine.set_related_threshold(related_thresh)
|
||||||
|
# Measure the time taken to search for related sets
|
||||||
|
time_start = time.time()
|
||||||
|
|
||||||
|
# Used for search to see how many candidates were found and how many were removed
|
||||||
|
candidates_amount = 0
|
||||||
|
candidates_after = 0
|
||||||
|
related_sets_found = 0
|
||||||
|
if is_search:
|
||||||
|
for ref_id, ref_set in enumerate(reference_sets):
|
||||||
|
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||||
|
ref_set)
|
||||||
|
candidates_amount += candidates_amount_temp
|
||||||
|
candidates_after += candidates_removed_temp
|
||||||
|
related_sets_found += len(related_sets_temp)
|
||||||
|
else:
|
||||||
|
# If not searching, we are discovering sets
|
||||||
|
silk_moth_engine.discover_sets(source_sets)
|
||||||
|
|
||||||
|
time_end = time.time()
|
||||||
|
elapsed_time = time_end - time_start
|
||||||
|
|
||||||
|
elapsed_times.append(elapsed_time)
|
||||||
|
|
||||||
|
# Create a new data dictionary for each iteration
|
||||||
|
if is_search:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": sim_thresh,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"reference_set_amount": len(reference_sets),
|
||||||
|
"source_set_amount": len(source_sets),
|
||||||
|
"label": label,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
"candidates_amount": candidates_amount,
|
||||||
|
"candidates_amount_after_filtering": candidates_after,
|
||||||
|
"related_sets_found": related_sets_found,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": sim_thresh,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"source_set_amount": len(source_sets),
|
||||||
|
"label": label,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
}
|
||||||
|
# Save results to a CSV file
|
||||||
|
save_experiment_results_to_csv(
|
||||||
|
results=data_overall,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||||
|
)
|
||||||
|
|
||||||
|
elapsed_times_final.append(elapsed_times)
|
||||||
|
_ = plot_elapsed_times(
|
||||||
|
related_thresholds=related_thresholds,
|
||||||
|
elapsed_times_list=elapsed_times_final,
|
||||||
|
fig_text=f"{file_name_prefix} (α = {sim_thresh})",
|
||||||
|
legend_labels=labels,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_α={sim_thresh}.png"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_reduction_experiment(related_thresholds, similarity_threshold, labels, source_sets, reference_sets,
|
||||||
|
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
related_thresholds : list[float]
|
||||||
|
Thresholds for determining relatedness between sets.
|
||||||
|
similarity_threshold : float
|
||||||
|
Thresholds for measuring similarity between sets.
|
||||||
|
labels : list[str]
|
||||||
|
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
|
||||||
|
source_sets : list[]
|
||||||
|
The sets to be compared against the reference sets or against itself.
|
||||||
|
reference_sets : list[]
|
||||||
|
The sets used as the reference for comparison.
|
||||||
|
sim_metric : callable
|
||||||
|
The metric function used to evaluate similarity between sets.
|
||||||
|
sim_func : callable
|
||||||
|
The function used to calculate similarity scores.
|
||||||
|
is_search : bool
|
||||||
|
Flag indicating whether to perform a search operation or discovery.
|
||||||
|
file_name_prefix : str
|
||||||
|
Prefix for naming output files generated during the experiment.
|
||||||
|
folder_path: str
|
||||||
|
Path to the folder where results will be saved.
|
||||||
|
"""
|
||||||
|
in_index_time_start = time.time()
|
||||||
|
initial_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
# Initialize and run the SilkMothEngine
|
||||||
|
silk_moth_engine = SilkMothEngine(
|
||||||
|
related_thresh=0,
|
||||||
|
source_sets=source_sets,
|
||||||
|
sim_metric=sim_metric,
|
||||||
|
sim_func=sim_func,
|
||||||
|
sim_thresh=similarity_threshold,
|
||||||
|
is_check_filter=False,
|
||||||
|
is_nn_filter=False,
|
||||||
|
)
|
||||||
|
# use dichotomy signature scheme for this experiment
|
||||||
|
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
|
||||||
|
|
||||||
|
in_index_time_end = time.time()
|
||||||
|
final_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||||
|
in_index_ram_usage = final_ram - initial_ram
|
||||||
|
|
||||||
|
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||||
|
|
||||||
|
elapsed_times_final = []
|
||||||
|
for label in labels:
|
||||||
|
|
||||||
|
if label == "REDUCTION":
|
||||||
|
silk_moth_engine.set_reduction(True)
|
||||||
|
elif label == "NO REDUCTION":
|
||||||
|
silk_moth_engine.set_reduction(False)
|
||||||
|
|
||||||
|
elapsed_times = []
|
||||||
|
for idx, related_thresh in enumerate(related_thresholds):
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, label = {label}")
|
||||||
|
|
||||||
|
silk_moth_engine.set_related_threshold(related_thresh)
|
||||||
|
# Measure the time taken to search for related sets
|
||||||
|
time_start = time.time()
|
||||||
|
|
||||||
|
# Used for search to see how many candidates were found and how many were removed
|
||||||
|
candidates_amount = 0
|
||||||
|
candidates_after = 0
|
||||||
|
if is_search:
|
||||||
|
for ref_id, ref_set in enumerate(reference_sets):
|
||||||
|
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||||
|
ref_set)
|
||||||
|
candidates_amount += candidates_amount_temp
|
||||||
|
candidates_after += candidates_removed_temp
|
||||||
|
else:
|
||||||
|
# If not searching, we are discovering sets
|
||||||
|
silk_moth_engine.discover_sets(source_sets)
|
||||||
|
|
||||||
|
time_end = time.time()
|
||||||
|
elapsed_time = time_end - time_start
|
||||||
|
|
||||||
|
elapsed_times.append(elapsed_time)
|
||||||
|
|
||||||
|
# Create a new data dictionary for each iteration
|
||||||
|
if is_search:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": similarity_threshold,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"reference_set_amount": len(reference_sets),
|
||||||
|
"source_set_amount": len(source_sets),
|
||||||
|
"label": label,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
"candidates_amount": candidates_amount,
|
||||||
|
"candidates_amount_after_filtering": candidates_after,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": similarity_threshold,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"source_set_amount": len(source_sets),
|
||||||
|
"label": label,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Save results to a CSV file
|
||||||
|
save_experiment_results_to_csv(
|
||||||
|
results=data_overall,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
elapsed_times_final.append(elapsed_times)
|
||||||
|
_ = plot_elapsed_times(
|
||||||
|
related_thresholds=related_thresholds,
|
||||||
|
elapsed_times_list=elapsed_times_final,
|
||||||
|
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
|
||||||
|
legend_labels=labels,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_scalability_experiment(related_thresholds, similarity_threshold, set_sizes, source_sets, reference_sets,
|
||||||
|
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
related_thresholds : list[float]
|
||||||
|
Thresholds for determining relatedness between sets.
|
||||||
|
similarity_threshold : float
|
||||||
|
Thresholds for measuring similarity between sets.
|
||||||
|
set_sizes : list[int]
|
||||||
|
Sizes of the sets to be used in the experiment.
|
||||||
|
source_sets : list[]
|
||||||
|
The sets to be compared against the reference sets or against itself.
|
||||||
|
reference_sets : list[]
|
||||||
|
The sets used as the reference for comparison.
|
||||||
|
sim_metric : callable
|
||||||
|
The metric function used to evaluate similarity between sets.
|
||||||
|
sim_func : callable
|
||||||
|
The function used to calculate similarity scores.
|
||||||
|
is_search : bool
|
||||||
|
Flag indicating whether to perform a search operation or discovery.
|
||||||
|
file_name_prefix : str
|
||||||
|
Prefix for naming output files generated during the experiment.
|
||||||
|
folder_path: str
|
||||||
|
Path to the folder where results will be saved.
|
||||||
|
"""
|
||||||
|
elapsed_times_final = []
|
||||||
|
for idx, related_thresh in enumerate(related_thresholds):
|
||||||
|
elapsed_times = []
|
||||||
|
for size in set_sizes:
|
||||||
|
in_index_time_start = time.time()
|
||||||
|
initial_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
# Initialize and run the SilkMothEngine
|
||||||
|
silk_moth_engine = SilkMothEngine(
|
||||||
|
related_thresh=0,
|
||||||
|
source_sets=source_sets[:size],
|
||||||
|
sim_metric=sim_metric,
|
||||||
|
sim_func=sim_func,
|
||||||
|
sim_thresh=similarity_threshold,
|
||||||
|
is_check_filter=True,
|
||||||
|
is_nn_filter=True,
|
||||||
|
)
|
||||||
|
in_index_time_end = time.time()
|
||||||
|
final_ram = measure_ram_usage()
|
||||||
|
|
||||||
|
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||||
|
in_index_ram_usage = final_ram - initial_ram
|
||||||
|
|
||||||
|
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||||
|
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, set_size = {size}")
|
||||||
|
|
||||||
|
silk_moth_engine.set_related_threshold(related_thresh)
|
||||||
|
# Measure the time taken to search for related sets
|
||||||
|
time_start = time.time()
|
||||||
|
|
||||||
|
if sim_func == edit_similarity:
|
||||||
|
# calc the maximum possible q-gram size based on sim_thresh
|
||||||
|
upper_bound_q = similarity_threshold / (1 - similarity_threshold)
|
||||||
|
q = floor(upper_bound_q)
|
||||||
|
|
||||||
|
print(f"Using q = {q} for edit similarity with sim_thresh = {similarity_threshold}")
|
||||||
|
print(f"Rebuilding Inverted Index with q = {q}...")
|
||||||
|
silk_moth_engine.set_q(q)
|
||||||
|
|
||||||
|
# Used for search to see how many candidates were found and how many were removed
|
||||||
|
candidates_amount = 0
|
||||||
|
candidates_after = 0
|
||||||
|
if is_search:
|
||||||
|
for ref_id, ref_set in enumerate(reference_sets):
|
||||||
|
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||||
|
ref_set)
|
||||||
|
candidates_amount += candidates_amount_temp
|
||||||
|
candidates_after += candidates_removed_temp
|
||||||
|
else:
|
||||||
|
# If not searching, we are discovering sets
|
||||||
|
silk_moth_engine.discover_sets(source_sets[:size])
|
||||||
|
|
||||||
|
time_end = time.time()
|
||||||
|
elapsed_time = time_end - time_start
|
||||||
|
|
||||||
|
elapsed_times.append(elapsed_time)
|
||||||
|
|
||||||
|
# Create a new data dictionary for each iteration
|
||||||
|
if is_search:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": similarity_threshold,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"reference_set_amount": len(reference_sets),
|
||||||
|
"source_set_amount": len(source_sets[:size]),
|
||||||
|
"set_size": size,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
"candidates_amount": candidates_amount,
|
||||||
|
"candidates_amount_after_filtering": candidates_after,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": similarity_threshold,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"source_set_amount": len(source_sets[:size]),
|
||||||
|
"set_size": size,
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||||
|
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Save results to a CSV file
|
||||||
|
save_experiment_results_to_csv(
|
||||||
|
results=data_overall,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||||
|
)
|
||||||
|
del silk_moth_engine
|
||||||
|
|
||||||
|
elapsed_times_final.append(elapsed_times)
|
||||||
|
|
||||||
|
# create legend labels based on set sizes
|
||||||
|
adjusted_legend_labels = [f"θ = {rt}" for rt in related_thresholds]
|
||||||
|
adjusted_set_sizes = [size / 100_000 for size in set_sizes]
|
||||||
|
_ = plot_elapsed_times(
|
||||||
|
related_thresholds=adjusted_set_sizes,
|
||||||
|
elapsed_times_list=elapsed_times_final,
|
||||||
|
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
|
||||||
|
legend_labels=adjusted_legend_labels,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png",
|
||||||
|
xlabel="Number of Sets (in 100ks)",
|
||||||
|
)
|
||||||
|
|
||||||
|
def run_matching_without_silkmoth_inc_dep(source_sets, reference_sets, related_thresholds, similarity_threshold, sim_metric, sim_fun , file_name_prefix, folder_path):
|
||||||
|
|
||||||
|
tokenizer = Tokenizer(sim_func=sim_fun)
|
||||||
|
|
||||||
|
for related_thresh in related_thresholds:
|
||||||
|
verifier = Verifier(sim_thresh=similarity_threshold, related_thresh=related_thresh,
|
||||||
|
sim_metric=sim_metric, sim_func=sim_fun, reduction=False)
|
||||||
|
related_sets = []
|
||||||
|
time_start = time.time()
|
||||||
|
for ref in reference_sets:
|
||||||
|
for source in source_sets:
|
||||||
|
if len(ref) > len(source):
|
||||||
|
continue
|
||||||
|
relatedness = verifier.get_relatedness(tokenizer.tokenize(ref), tokenizer.tokenize(source))
|
||||||
|
if relatedness >= related_thresh:
|
||||||
|
related_sets.append((source, relatedness))
|
||||||
|
|
||||||
|
time_end = time.time()
|
||||||
|
elapsed_time = time_end - time_start
|
||||||
|
|
||||||
|
data_overall = {
|
||||||
|
"similarity_threshold": similarity_threshold,
|
||||||
|
"related_threshold": related_thresh,
|
||||||
|
"source_set_amount": len(source_sets),
|
||||||
|
"reference_set_amount": len(reference_sets),
|
||||||
|
"label": "RAW MATCH",
|
||||||
|
"elapsed_time": round(elapsed_time, 3),
|
||||||
|
"matches_found": len(related_sets)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Save results to a CSV file
|
||||||
|
save_experiment_results_to_csv(
|
||||||
|
results=data_overall,
|
||||||
|
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
|
||||||
|
0.0,0.7,1000,500000,NO FILTER,1036.548,49.107,7727.559,3006749,3006749,986715
|
||||||
|
0.0,0.75,1000,500000,NO FILTER,871.225,49.107,7727.559,2673348,2673348,964206
|
||||||
|
0.0,0.8,1000,500000,NO FILTER,695.528,49.107,7727.559,2273416,2273416,934002
|
||||||
|
0.0,0.85,1000,500000,NO FILTER,548.878,49.107,7727.559,1907985,1907985,879744
|
||||||
|
0.0,0.7,1000,500000,CHECK FILTER,980.124,49.107,7727.559,3006749,2852034,986715
|
||||||
|
0.0,0.75,1000,500000,CHECK FILTER,789.947,49.107,7727.559,2673348,2531660,964206
|
||||||
|
0.0,0.8,1000,500000,CHECK FILTER,590.707,49.107,7727.559,2273416,2107346,934002
|
||||||
|
0.0,0.85,1000,500000,CHECK FILTER,427.982,49.107,7727.559,1907985,1728877,879744
|
||||||
|
0.0,0.7,1000,500000,NN FILTER,533.776,49.107,7727.559,3006749,2547,2535
|
||||||
|
0.0,0.75,1000,500000,NN FILTER,448.358,49.107,7727.559,2673348,2394,2382
|
||||||
|
0.0,0.8,1000,500000,NN FILTER,359.112,49.107,7727.559,2273416,1077,1077
|
||||||
|
0.0,0.85,1000,500000,NN FILTER,268.529,49.107,7727.559,1907985,1037,1037
|
||||||
|
0.25,0.7,1000,500000,NO FILTER,1038.225,49.107,7727.559,3006749,3006749,984756
|
||||||
|
0.25,0.75,1000,500000,NO FILTER,866.06,49.107,7727.559,2673348,2673348,963792
|
||||||
|
0.25,0.8,1000,500000,NO FILTER,693.589,49.107,7727.559,2273416,2273416,933799
|
||||||
|
0.25,0.85,1000,500000,NO FILTER,545.784,49.107,7727.559,1907985,1907985,878482
|
||||||
|
0.25,0.7,1000,500000,CHECK FILTER,975.103,49.107,7727.559,3006749,2852028,984756
|
||||||
|
0.25,0.75,1000,500000,CHECK FILTER,787.87,49.107,7727.559,2673348,2531660,963792
|
||||||
|
0.25,0.8,1000,500000,CHECK FILTER,589.608,49.107,7727.559,2273416,2107346,933799
|
||||||
|
0.25,0.85,1000,500000,CHECK FILTER,426.222,49.107,7727.559,1907985,1728877,878482
|
||||||
|
0.25,0.7,1000,500000,NN FILTER,573.448,49.107,7727.559,3006749,2544,2532
|
||||||
|
0.25,0.75,1000,500000,NN FILTER,483.1,49.107,7727.559,2673348,2394,2382
|
||||||
|
0.25,0.8,1000,500000,NN FILTER,385.999,49.107,7727.559,2273416,1077,1077
|
||||||
|
0.25,0.85,1000,500000,NN FILTER,288.687,49.107,7727.559,1907985,1037,1037
|
||||||
|
0.5,0.7,1000,500000,NO FILTER,1031.681,49.107,7727.559,3006749,3006749,975892
|
||||||
|
0.5,0.75,1000,500000,NO FILTER,867.694,49.107,7727.559,2673348,2673348,951793
|
||||||
|
0.5,0.8,1000,500000,NO FILTER,693.398,49.107,7727.559,2273416,2273416,931599
|
||||||
|
0.5,0.85,1000,500000,NO FILTER,546.702,49.107,7727.559,1907985,1907985,875833
|
||||||
|
0.5,0.7,1000,500000,CHECK FILTER,971.71,49.107,7727.559,3006749,2848668,975892
|
||||||
|
0.5,0.75,1000,500000,CHECK FILTER,783.145,49.107,7727.559,2673348,2529966,951793
|
||||||
|
0.5,0.8,1000,500000,CHECK FILTER,585.346,49.107,7727.559,2273416,2106355,931599
|
||||||
|
0.5,0.85,1000,500000,CHECK FILTER,424.629,49.107,7727.559,1907985,1728640,875833
|
||||||
|
0.5,0.7,1000,500000,NN FILTER,573.046,49.107,7727.559,3006749,2544,2532
|
||||||
|
0.5,0.75,1000,500000,NN FILTER,482.035,49.107,7727.559,2673348,2394,2382
|
||||||
|
0.5,0.8,1000,500000,NN FILTER,385.754,49.107,7727.559,2273416,1077,1077
|
||||||
|
0.5,0.85,1000,500000,NN FILTER,288.24,49.107,7727.559,1907985,1037,1037
|
||||||
|
0.75,0.7,1000,500000,NO FILTER,1032.605,49.107,7727.559,3006749,3006749,973885
|
||||||
|
0.75,0.75,1000,500000,NO FILTER,866.218,49.107,7727.559,2673348,2673348,949627
|
||||||
|
0.75,0.8,1000,500000,NO FILTER,693.19,49.107,7727.559,2273416,2273416,929232
|
||||||
|
0.75,0.85,1000,500000,NO FILTER,548.07,49.107,7727.559,1907985,1907985,875163
|
||||||
|
0.75,0.7,1000,500000,CHECK FILTER,960.003,49.107,7727.559,3006749,2838145,973885
|
||||||
|
0.75,0.75,1000,500000,CHECK FILTER,773.8,49.107,7727.559,2673348,2519134,949627
|
||||||
|
0.75,0.8,1000,500000,CHECK FILTER,577.671,49.107,7727.559,2273416,2100303,929232
|
||||||
|
0.75,0.85,1000,500000,CHECK FILTER,417.292,49.107,7727.559,1907985,1725354,875163
|
||||||
|
0.75,0.7,1000,500000,NN FILTER,544.018,49.107,7727.559,3006749,2544,2532
|
||||||
|
0.75,0.75,1000,500000,NN FILTER,463.915,49.107,7727.559,2673348,2394,2382
|
||||||
|
0.75,0.8,1000,500000,NN FILTER,378.184,49.107,7727.559,2273416,1077,1077
|
||||||
|
0.75,0.85,1000,500000,NN FILTER,285.8,49.107,7727.559,1907985,1040,1040
|
||||||
|
|
After Width: | Height: | Size: 195 KiB |
|
After Width: | Height: | Size: 199 KiB |
|
After Width: | Height: | Size: 198 KiB |
|
After Width: | Height: | Size: 195 KiB |
|
After Width: | Height: | Size: 125 KiB |
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
|
||||||
|
0.0,0.7,200,500000,NO FILTER,6753.593,49.277,7720.887,622080,622080,233513
|
||||||
|
0.0,0.75,200,500000,NO FILTER,6812.967,49.277,7720.887,575078,575078,223644
|
||||||
|
0.0,0.8,200,500000,NO FILTER,4953.635,49.277,7720.887,479650,479650,221376
|
||||||
|
0.0,0.85,200,500000,NO FILTER,4212.413,49.277,7720.887,423078,423078,196944
|
||||||
|
0.0,0.7,200,500000,CHECK FILTER,3835.233,49.277,7720.887,622080,589307,233513
|
||||||
|
0.0,0.75,200,500000,CHECK FILTER,3348.061,49.277,7720.887,575078,549687,223644
|
||||||
|
0.0,0.8,200,500000,CHECK FILTER,2414.995,49.277,7720.887,479650,438680,221376
|
||||||
|
0.0,0.85,200,500000,CHECK FILTER,1874.261,49.277,7720.887,423078,393028,196944
|
||||||
|
0.0,0.7,200,500000,NN FILTER,126.601,49.277,7720.887,622080,615,603
|
||||||
|
0.0,0.75,200,500000,NN FILTER,108.886,49.277,7720.887,575078,332,320
|
||||||
|
0.0,0.8,200,500000,NN FILTER,80.436,49.277,7720.887,479650,1,1
|
||||||
|
0.0,0.85,200,500000,NN FILTER,59.824,49.277,7720.887,423078,1,1
|
||||||
|
0.25,0.7,200,500000,NO FILTER,2191.216,49.277,7720.887,622080,622080,232290
|
||||||
|
0.25,0.75,200,500000,NO FILTER,1915.087,49.277,7720.887,575078,575078,223444
|
||||||
|
0.25,0.8,200,500000,NO FILTER,1544.113,49.277,7720.887,479650,479650,221284
|
||||||
|
0.25,0.85,200,500000,NO FILTER,1354.29,49.277,7720.887,423078,423078,196116
|
||||||
|
0.25,0.7,200,500000,CHECK FILTER,1809.643,49.277,7720.887,622080,589307,232290
|
||||||
|
0.25,0.75,200,500000,CHECK FILTER,1548.963,49.277,7720.887,575078,549687,223444
|
||||||
|
0.25,0.8,200,500000,CHECK FILTER,1277.618,49.277,7720.887,479650,438680,221284
|
||||||
|
0.25,0.85,200,500000,CHECK FILTER,1111.088,49.277,7720.887,423078,393028,196116
|
||||||
|
0.25,0.7,200,500000,NN FILTER,131.183,49.277,7720.887,622080,615,603
|
||||||
|
0.25,0.75,200,500000,NN FILTER,114.192,49.277,7720.887,575078,332,320
|
||||||
|
0.25,0.8,200,500000,NN FILTER,84.253,49.277,7720.887,479650,1,1
|
||||||
|
0.25,0.85,200,500000,NN FILTER,62.864,49.277,7720.887,423078,1,1
|
||||||
|
0.5,0.7,200,500000,NO FILTER,1682.409,49.277,7720.887,622080,622080,230903
|
||||||
|
0.5,0.75,200,500000,NO FILTER,1491.797,49.277,7720.887,575078,575078,222613
|
||||||
|
0.5,0.8,200,500000,NO FILTER,1250.727,49.277,7720.887,479650,479650,219875
|
||||||
|
0.5,0.85,200,500000,NO FILTER,1083.762,49.277,7720.887,423078,423078,195759
|
||||||
|
0.5,0.7,200,500000,CHECK FILTER,1436.208,49.277,7720.887,622080,588701,230903
|
||||||
|
0.5,0.75,200,500000,CHECK FILTER,1250.22,49.277,7720.887,575078,549178,222613
|
||||||
|
0.5,0.8,200,500000,CHECK FILTER,1023.904,49.277,7720.887,479650,438258,219875
|
||||||
|
0.5,0.85,200,500000,CHECK FILTER,893.938,49.277,7720.887,423078,392937,195759
|
||||||
|
0.5,0.7,200,500000,NN FILTER,129.51,49.277,7720.887,622080,615,603
|
||||||
|
0.5,0.75,200,500000,NN FILTER,112.158,49.277,7720.887,575078,332,320
|
||||||
|
0.5,0.8,200,500000,NN FILTER,83.434,49.277,7720.887,479650,1,1
|
||||||
|
0.5,0.85,200,500000,NN FILTER,62.648,49.277,7720.887,423078,1,1
|
||||||
|
0.75,0.7,200,500000,NO FILTER,1447.675,49.277,7720.887,622080,622080,230497
|
||||||
|
0.75,0.75,200,500000,NO FILTER,1270.052,49.277,7720.887,575078,575078,222063
|
||||||
|
0.75,0.8,200,500000,NO FILTER,1039.89,49.277,7720.887,479650,479650,219411
|
||||||
|
0.75,0.85,200,500000,NO FILTER,879.273,49.277,7720.887,423078,423078,195601
|
||||||
|
0.75,0.7,200,500000,CHECK FILTER,1193.541,49.277,7720.887,622080,586297,230497
|
||||||
|
0.75,0.75,200,500000,CHECK FILTER,1023.672,49.277,7720.887,575078,546701,222063
|
||||||
|
0.75,0.8,200,500000,CHECK FILTER,825.541,49.277,7720.887,479650,436782,219411
|
||||||
|
0.75,0.85,200,500000,CHECK FILTER,704.52,49.277,7720.887,423078,391809,195601
|
||||||
|
0.75,0.7,200,500000,NN FILTER,120.522,49.277,7720.887,622080,615,603
|
||||||
|
0.75,0.75,200,500000,NN FILTER,107.657,49.277,7720.887,575078,332,320
|
||||||
|
0.75,0.8,200,500000,NN FILTER,78.897,49.277,7720.887,479650,1,1
|
||||||
|
0.75,0.85,200,500000,NN FILTER,57.66,49.277,7720.887,423078,1,1
|
||||||
|
|
After Width: | Height: | Size: 140 KiB |
|
After Width: | Height: | Size: 139 KiB |
|
After Width: | Height: | Size: 151 KiB |
|
After Width: | Height: | Size: 149 KiB |
@@ -0,0 +1,2 @@
|
|||||||
|
experiment name,elem/set,tokens/elem
|
||||||
|
Inclusion Dependency,17.81003,25.41035090901026
|
||||||
|
@@ -0,0 +1,17 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||||
|
0.0,0.7,200,500000,REDUCTION,6283.871,45.782,7700.914,622080,622080
|
||||||
|
0.0,0.75,200,500000,REDUCTION,5651.069,45.782,7700.914,575078,575078
|
||||||
|
0.0,0.8,200,500000,REDUCTION,4170.768,45.782,7700.914,479650,479650
|
||||||
|
0.0,0.85,200,500000,REDUCTION,3514.723,45.782,7700.914,423078,423078
|
||||||
|
0.0,0.7,200,500000,NO REDUCTION,6771.001,45.782,7700.914,622080,622080
|
||||||
|
0.0,0.75,200,500000,NO REDUCTION,6117.305,45.782,7700.914,575078,575078
|
||||||
|
0.0,0.8,200,500000,NO REDUCTION,4573.585,45.782,7700.914,479650,479650
|
||||||
|
0.0,0.85,200,500000,NO REDUCTION,3894.681,45.782,7700.914,423078,423078
|
||||||
|
0.0,0.7,200,500000,REDUCTION,6142.242,49.376,7721.383,622080,622080
|
||||||
|
0.0,0.75,200,500000,REDUCTION,5495.346,49.376,7721.383,575078,575078
|
||||||
|
0.0,0.8,200,500000,REDUCTION,4061.815,49.376,7721.383,479650,479650
|
||||||
|
0.0,0.85,200,500000,REDUCTION,3429.474,49.376,7721.383,423078,423078
|
||||||
|
0.0,0.7,200,500000,NO REDUCTION,6622.959,49.376,7721.383,622080,622080
|
||||||
|
0.0,0.75,200,500000,NO REDUCTION,5960.971,49.376,7721.383,575078,575078
|
||||||
|
0.0,0.8,200,500000,NO REDUCTION,4489.11,49.376,7721.383,479650,479650
|
||||||
|
0.0,0.85,200,500000,NO REDUCTION,3794.505,49.376,7721.383,423078,423078
|
||||||
|
|
After Width: | Height: | Size: 166 KiB |
@@ -0,0 +1,21 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||||
|
0.5,0.7,200,100000,100000,69.222,11.405,1554.535,134576,46830
|
||||||
|
0.5,0.7,200,200000,200000,134.718,23.409,1659.543,254379,93573
|
||||||
|
0.5,0.7,200,300000,300000,206.136,32.782,1791.512,373007,139377
|
||||||
|
0.5,0.7,200,400000,400000,275.559,51.827,2040.961,499998,186205
|
||||||
|
0.5,0.7,200,500000,500000,353.944,51.169,2027.262,622080,233091
|
||||||
|
0.5,0.75,200,100000,100000,64.988,5.539,0.254,124611,45115
|
||||||
|
0.5,0.75,200,200000,200000,126.721,24.159,192.152,236137,90048
|
||||||
|
0.5,0.75,200,300000,300000,193.126,32.91,2217.562,347108,134199
|
||||||
|
0.5,0.75,200,400000,400000,259.254,50.945,1535.723,462815,179223
|
||||||
|
0.5,0.75,200,500000,500000,328.0,59.734,2526.176,575078,224315
|
||||||
|
0.5,0.8,200,100000,100000,59.984,5.544,0.77,104812,44549
|
||||||
|
0.5,0.8,200,200000,200000,123.595,23.419,-229.445,202489,88907
|
||||||
|
0.5,0.8,200,300000,300000,183.55,37.277,2302.273,300462,132525
|
||||||
|
0.5,0.8,200,400000,400000,239.431,45.86,1268.406,386895,176985
|
||||||
|
0.5,0.8,200,500000,500000,311.525,58.657,2716.348,479650,221057
|
||||||
|
0.5,0.85,200,100000,100000,56.371,9.486,-151.641,87451,39657
|
||||||
|
0.5,0.85,200,200000,200000,108.674,23.698,-889.457,171938,79056
|
||||||
|
0.5,0.85,200,300000,300000,164.616,33.799,2748.523,251392,117969
|
||||||
|
0.5,0.85,200,400000,400000,220.908,45.263,805.023,331901,157572
|
||||||
|
0.5,0.85,200,500000,500000,281.56,65.197,3474.547,423078,197145
|
||||||
|
|
After Width: | Height: | Size: 241 KiB |
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||||
|
0.0,0.7,200,500000,SigType.WEIGHTED,6915.71,47.599,7701.59,622080,622080
|
||||||
|
0.0,0.75,200,500000,SigType.WEIGHTED,6230.769,47.599,7701.59,575078,575078
|
||||||
|
0.0,0.8,200,500000,SigType.WEIGHTED,4633.178,47.599,7701.59,479650,479650
|
||||||
|
0.0,0.85,200,500000,SigType.WEIGHTED,3948.011,47.599,7701.59,423078,423078
|
||||||
|
0.0,0.7,200,500000,SigType.SKYLINE,6839.554,47.599,7701.59,622080,622080
|
||||||
|
0.0,0.75,200,500000,SigType.SKYLINE,6156.19,47.599,7701.59,575078,575078
|
||||||
|
0.0,0.8,200,500000,SigType.SKYLINE,4601.987,47.599,7701.59,479650,479650
|
||||||
|
0.0,0.85,200,500000,SigType.SKYLINE,3921.286,47.599,7701.59,423078,423078
|
||||||
|
0.0,0.7,200,500000,SigType.DICHOTOMY,6824.442,47.599,7701.59,622080,622080
|
||||||
|
0.0,0.75,200,500000,SigType.DICHOTOMY,6158.089,47.599,7701.59,575078,575078
|
||||||
|
0.0,0.8,200,500000,SigType.DICHOTOMY,4601.877,47.599,7701.59,479650,479650
|
||||||
|
0.0,0.85,200,500000,SigType.DICHOTOMY,3923.695,47.599,7701.59,423078,423078
|
||||||
|
0.25,0.7,200,500000,SigType.WEIGHTED,1990.666,47.599,7701.59,622080,622080
|
||||||
|
0.25,0.75,200,500000,SigType.WEIGHTED,1722.451,47.599,7701.59,575078,575078
|
||||||
|
0.25,0.8,200,500000,SigType.WEIGHTED,1438.235,47.599,7701.59,479650,479650
|
||||||
|
0.25,0.85,200,500000,SigType.WEIGHTED,1264.852,47.599,7701.59,423078,423078
|
||||||
|
0.25,0.7,200,500000,SigType.SKYLINE,1989.546,47.599,7701.59,622080,622080
|
||||||
|
0.25,0.75,200,500000,SigType.SKYLINE,1719.169,47.599,7701.59,575078,575078
|
||||||
|
0.25,0.8,200,500000,SigType.SKYLINE,1440.077,47.599,7701.59,479650,479650
|
||||||
|
0.25,0.85,200,500000,SigType.SKYLINE,1267.701,47.599,7701.59,423078,423078
|
||||||
|
0.25,0.7,200,500000,SigType.DICHOTOMY,2046.949,47.599,7701.59,622270,622270
|
||||||
|
0.25,0.75,200,500000,SigType.DICHOTOMY,1966.499,47.599,7701.59,575268,575268
|
||||||
|
0.25,0.8,200,500000,SigType.DICHOTOMY,1485.458,47.599,7701.59,479650,479650
|
||||||
|
0.25,0.85,200,500000,SigType.DICHOTOMY,1436.847,47.599,7701.59,423078,423078
|
||||||
|
0.5,0.7,200,500000,SigType.WEIGHTED,1767.439,47.599,7701.59,622080,622080
|
||||||
|
0.5,0.75,200,500000,SigType.WEIGHTED,1565.259,47.599,7701.59,575078,575078
|
||||||
|
0.5,0.8,200,500000,SigType.WEIGHTED,1160.579,47.599,7701.59,479650,479650
|
||||||
|
0.5,0.85,200,500000,SigType.WEIGHTED,1014.452,47.599,7701.59,423078,423078
|
||||||
|
0.5,0.7,200,500000,SigType.SKYLINE,1589.081,47.599,7701.59,622054,622054
|
||||||
|
0.5,0.75,200,500000,SigType.SKYLINE,1393.117,47.599,7701.59,575050,575050
|
||||||
|
0.5,0.8,200,500000,SigType.SKYLINE,1154.931,47.599,7701.59,479622,479622
|
||||||
|
0.5,0.85,200,500000,SigType.SKYLINE,1025.061,47.599,7701.59,423078,423078
|
||||||
|
0.5,0.7,200,500000,SigType.DICHOTOMY,2777.528,47.599,7701.59,936785,936785
|
||||||
|
0.5,0.75,200,500000,SigType.DICHOTOMY,2340.389,47.599,7701.59,888736,888736
|
||||||
|
0.5,0.8,200,500000,SigType.DICHOTOMY,1678.145,47.599,7701.59,673929,673929
|
||||||
|
0.5,0.85,200,500000,SigType.DICHOTOMY,1374.518,47.599,7701.59,517483,517483
|
||||||
|
0.75,0.7,200,500000,SigType.WEIGHTED,1354.402,47.599,7701.59,622080,622080
|
||||||
|
0.75,0.75,200,500000,SigType.WEIGHTED,1187.603,47.599,7701.59,575078,575078
|
||||||
|
0.75,0.8,200,500000,SigType.WEIGHTED,971.469,47.599,7701.59,479650,479650
|
||||||
|
0.75,0.85,200,500000,SigType.WEIGHTED,822.075,47.599,7701.59,423078,423078
|
||||||
|
0.75,0.7,200,500000,SigType.SKYLINE,1303.676,47.599,7701.59,594466,594466
|
||||||
|
0.75,0.75,200,500000,SigType.SKYLINE,1152.405,47.599,7701.59,560020,560020
|
||||||
|
0.75,0.8,200,500000,SigType.SKYLINE,932.283,47.599,7701.59,467458,467458
|
||||||
|
0.75,0.85,200,500000,SigType.SKYLINE,816.709,47.599,7701.59,420962,420962
|
||||||
|
0.75,0.7,200,500000,SigType.DICHOTOMY,5710.524,47.599,7701.59,2410732,2410732
|
||||||
|
0.75,0.75,200,500000,SigType.DICHOTOMY,5072.603,47.599,7701.59,2145096,2145096
|
||||||
|
0.75,0.8,200,500000,SigType.DICHOTOMY,4403.341,47.599,7701.59,1739362,1739362
|
||||||
|
0.75,0.85,200,500000,SigType.DICHOTOMY,2735.424,47.599,7701.59,1078937,1078937
|
||||||
|
|
After Width: | Height: | Size: 200 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 159 KiB |
@@ -0,0 +1,5 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,reference_set_amount,label,elapsed_time,matches_found
|
||||||
|
0.5,0.7,500000,200,RAW MATCH,6945.364,230903
|
||||||
|
0.5,0.75,500000,200,RAW MATCH,6965.759,222613
|
||||||
|
0.5,0.8,500000,200,RAW MATCH,6974.576,219875
|
||||||
|
0.5,0.85,500000,200,RAW MATCH,7011.368,195759
|
||||||
|
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||||
|
0.0,0.7,60000,60000,NO FILTER,3321.166,2.336,115.465,3055067,3055067
|
||||||
|
0.0,0.75,60000,60000,NO FILTER,1997.976,2.336,115.465,2321584,2321584
|
||||||
|
0.0,0.8,60000,60000,NO FILTER,1226.647,2.336,115.465,1265300,1265300
|
||||||
|
0.0,0.85,60000,60000,NO FILTER,530.302,2.336,115.465,642202,642202
|
||||||
|
0.0,0.7,60000,60000,CHECK FILTER,3766.567,2.336,115.465,3055067,2464704
|
||||||
|
0.0,0.75,60000,60000,CHECK FILTER,2241.664,2.336,115.465,2321584,1780582
|
||||||
|
0.0,0.8,60000,60000,CHECK FILTER,1371.372,2.336,115.465,1265300,936432
|
||||||
|
0.0,0.85,60000,60000,CHECK FILTER,2052.574,2.336,115.465,642202,523745
|
||||||
|
0.0,0.7,60000,60000,NN FILTER,1752.545,2.336,115.465,3055067,0
|
||||||
|
0.0,0.75,60000,60000,NN FILTER,1410.607,2.336,115.465,2321584,0
|
||||||
|
0.0,0.8,60000,60000,NN FILTER,817.098,2.336,115.465,1265300,0
|
||||||
|
0.0,0.85,60000,60000,NN FILTER,450.277,2.336,115.465,642202,0
|
||||||
|
0.25,0.7,60000,60000,NO FILTER,4295.794,2.336,115.465,3055067,3055067
|
||||||
|
0.25,0.75,60000,60000,NO FILTER,1973.377,2.336,115.465,2321584,2321584
|
||||||
|
0.25,0.8,60000,60000,NO FILTER,1212.983,2.336,115.465,1265300,1265300
|
||||||
|
0.25,0.85,60000,60000,NO FILTER,522.616,2.336,115.465,642202,642202
|
||||||
|
0.25,0.7,60000,60000,CHECK FILTER,3200.851,2.336,115.465,3055067,2455726
|
||||||
|
0.25,0.75,60000,60000,CHECK FILTER,1889.267,2.336,115.465,2321584,1770634
|
||||||
|
0.25,0.8,60000,60000,CHECK FILTER,1147.932,2.336,115.465,1265300,928712
|
||||||
|
0.25,0.85,60000,60000,CHECK FILTER,498.44,2.336,115.465,642202,522759
|
||||||
|
0.25,0.7,60000,60000,NN FILTER,122.104,2.336,115.465,3055067,0
|
||||||
|
0.25,0.75,60000,60000,NN FILTER,88.259,2.336,115.465,2321584,0
|
||||||
|
0.25,0.8,60000,60000,NN FILTER,49.714,2.336,115.465,1265300,0
|
||||||
|
0.25,0.85,60000,60000,NN FILTER,23.838,2.336,115.465,642202,0
|
||||||
|
0.5,0.7,60000,60000,NO FILTER,3272.056,2.336,115.465,3055067,3055067
|
||||||
|
0.5,0.75,60000,60000,NO FILTER,1961.328,2.336,115.465,2321584,2321584
|
||||||
|
0.5,0.8,60000,60000,NO FILTER,1200.994,2.336,115.465,1265300,1265300
|
||||||
|
0.5,0.85,60000,60000,NO FILTER,511.108,2.336,115.465,642202,642202
|
||||||
|
0.5,0.7,60000,60000,CHECK FILTER,3183.991,2.336,115.465,3055067,2437997
|
||||||
|
0.5,0.75,60000,60000,CHECK FILTER,1875.468,2.336,115.465,2321584,1756738
|
||||||
|
0.5,0.8,60000,60000,CHECK FILTER,1137.157,2.336,115.465,1265300,918967
|
||||||
|
0.5,0.85,60000,60000,CHECK FILTER,488.508,2.336,115.465,642202,517859
|
||||||
|
0.5,0.7,60000,60000,NN FILTER,120.567,2.336,115.465,3055067,0
|
||||||
|
0.5,0.75,60000,60000,NN FILTER,87.173,2.336,115.465,2321584,0
|
||||||
|
0.5,0.8,60000,60000,NN FILTER,49.292,2.336,115.465,1265300,0
|
||||||
|
0.5,0.85,60000,60000,NN FILTER,23.97,2.336,115.465,642202,0
|
||||||
|
0.75,0.7,60000,60000,NO FILTER,3085.617,2.336,115.465,3055067,3055067
|
||||||
|
0.75,0.75,60000,60000,NO FILTER,1788.559,2.336,115.465,2321584,2321584
|
||||||
|
0.75,0.8,60000,60000,NO FILTER,1046.714,2.336,115.465,1265300,1265300
|
||||||
|
0.75,0.85,60000,60000,NO FILTER,481.793,2.336,115.465,642202,642202
|
||||||
|
0.75,0.7,60000,60000,CHECK FILTER,2991.745,2.336,115.465,3055067,2428269
|
||||||
|
0.75,0.75,60000,60000,CHECK FILTER,1699.433,2.336,115.465,2321584,1750589
|
||||||
|
0.75,0.8,60000,60000,CHECK FILTER,983.657,2.336,115.465,1265300,916628
|
||||||
|
0.75,0.85,60000,60000,CHECK FILTER,458.081,2.336,115.465,642202,516012
|
||||||
|
0.75,0.7,60000,60000,NN FILTER,119.557,2.336,115.465,3055067,0
|
||||||
|
0.75,0.75,60000,60000,NN FILTER,86.338,2.336,115.465,2321584,0
|
||||||
|
0.75,0.8,60000,60000,NN FILTER,48.63,2.336,115.465,1265300,0
|
||||||
|
0.75,0.85,60000,60000,NN FILTER,23.63,2.336,115.465,642202,0
|
||||||
|
|
After Width: | Height: | Size: 198 KiB |
|
After Width: | Height: | Size: 164 KiB |
|
After Width: | Height: | Size: 171 KiB |
|
After Width: | Height: | Size: 173 KiB |
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.0,0.7,60000,NO FILTER,5210.037,1.383,95.605
|
||||||
|
0.0,0.75,60000,NO FILTER,4654.41,1.383,95.605
|
||||||
|
0.0,0.8,60000,NO FILTER,3891.372,1.383,95.605
|
||||||
|
0.0,0.85,60000,NO FILTER,3561.118,1.383,95.605
|
||||||
|
0.0,0.7,60000,CHECK FILTER,5374.941,1.383,95.605
|
||||||
|
0.0,0.75,60000,CHECK FILTER,4772.542,1.383,95.605
|
||||||
|
0.0,0.8,60000,CHECK FILTER,4004.38,1.383,95.605
|
||||||
|
0.0,0.85,60000,CHECK FILTER,3653.843,1.383,95.605
|
||||||
|
0.0,0.7,60000,NN FILTER,3889.903,1.383,95.605
|
||||||
|
0.0,0.75,60000,NN FILTER,3739.136,1.383,95.605
|
||||||
|
0.0,0.8,60000,NN FILTER,3609.17,1.383,95.605
|
||||||
|
0.0,0.85,60000,NN FILTER,3517.33,1.383,95.605
|
||||||
|
0.25,0.7,60000,NO FILTER,5157.674,1.383,95.605
|
||||||
|
0.25,0.75,60000,NO FILTER,4621.14,1.383,95.605
|
||||||
|
0.25,0.8,60000,NO FILTER,3905.856,1.383,95.605
|
||||||
|
0.25,0.85,60000,NO FILTER,3598.239,1.383,95.605
|
||||||
|
0.25,0.7,60000,CHECK FILTER,5331.451,1.383,95.605
|
||||||
|
0.25,0.75,60000,CHECK FILTER,4769.428,1.383,95.605
|
||||||
|
0.25,0.8,60000,CHECK FILTER,4042.779,1.383,95.605
|
||||||
|
0.25,0.85,60000,CHECK FILTER,3709.669,1.383,95.605
|
||||||
|
0.25,0.7,60000,NN FILTER,3910.54,1.383,95.605
|
||||||
|
0.25,0.75,60000,NN FILTER,3760.587,1.383,95.605
|
||||||
|
0.25,0.8,60000,NN FILTER,3644.443,1.383,95.605
|
||||||
|
0.25,0.85,60000,NN FILTER,3558.579,1.383,95.605
|
||||||
|
0.5,0.7,60000,NO FILTER,5143.478,1.383,95.605
|
||||||
|
0.5,0.75,60000,NO FILTER,4670.328,1.383,95.605
|
||||||
|
0.5,0.8,60000,NO FILTER,3917.002,1.383,95.605
|
||||||
|
0.5,0.85,60000,NO FILTER,3556.487,1.383,95.605
|
||||||
|
0.5,0.7,60000,CHECK FILTER,5279.287,1.383,95.605
|
||||||
|
0.5,0.75,60000,CHECK FILTER,4749.58,1.383,95.605
|
||||||
|
0.5,0.8,60000,CHECK FILTER,4009.224,1.383,95.605
|
||||||
|
0.5,0.85,60000,CHECK FILTER,3659.874,1.383,95.605
|
||||||
|
0.5,0.7,60000,NN FILTER,3897.174,1.383,95.605
|
||||||
|
0.5,0.75,60000,NN FILTER,3771.733,1.383,95.605
|
||||||
|
0.5,0.8,60000,NN FILTER,3657.094,1.383,95.605
|
||||||
|
0.5,0.85,60000,NN FILTER,3553.523,1.383,95.605
|
||||||
|
0.75,0.7,60000,NO FILTER,5107.903,1.383,95.605
|
||||||
|
0.75,0.75,60000,NO FILTER,4582.298,1.383,95.605
|
||||||
|
0.75,0.8,60000,NO FILTER,3889.505,1.383,95.605
|
||||||
|
0.75,0.85,60000,NO FILTER,3559.531,1.383,95.605
|
||||||
|
0.75,0.7,60000,CHECK FILTER,5254.747,1.383,95.605
|
||||||
|
0.75,0.75,60000,CHECK FILTER,4722.922,1.383,95.605
|
||||||
|
0.75,0.8,60000,CHECK FILTER,3977.968,1.383,95.605
|
||||||
|
0.75,0.85,60000,CHECK FILTER,3635.288,1.383,95.605
|
||||||
|
0.75,0.7,60000,NN FILTER,3874.915,1.383,95.605
|
||||||
|
0.75,0.75,60000,NN FILTER,3786.562,1.383,95.605
|
||||||
|
0.75,0.8,60000,NN FILTER,3901.219,1.383,95.605
|
||||||
|
0.75,0.85,60000,NN FILTER,3541.992,1.383,95.605
|
||||||
|
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 189 KiB |
|
After Width: | Height: | Size: 188 KiB |
@@ -0,0 +1,2 @@
|
|||||||
|
experiment name,elem/set,tokens/elem
|
||||||
|
Schema Matching,4.839676,7.059130404597332
|
||||||
|
@@ -0,0 +1,21 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.0,0.7,12000,12000,162.511,1.149,10.633
|
||||||
|
0.0,0.7,24000,24000,629.266,0.912,-14.359
|
||||||
|
0.0,0.7,36000,36000,1448.696,1.047,-3.805
|
||||||
|
0.0,0.7,48000,48000,2589.084,0.36,8.324
|
||||||
|
0.0,0.7,60000,60000,4018.602,1.276,30.07
|
||||||
|
0.0,0.75,12000,12000,156.237,0.079,0.0
|
||||||
|
0.0,0.75,24000,24000,601.804,0.166,0.0
|
||||||
|
0.0,0.75,36000,36000,1391.051,0.258,14.434
|
||||||
|
0.0,0.75,48000,48000,2485.407,1.142,23.73
|
||||||
|
0.0,0.75,60000,60000,3865.861,1.259,20.078
|
||||||
|
0.0,0.8,12000,12000,150.844,0.075,0.0
|
||||||
|
0.0,0.8,24000,24000,579.687,0.169,0.0
|
||||||
|
0.0,0.8,36000,36000,1337.54,0.259,6.953
|
||||||
|
0.0,0.8,48000,48000,2393.576,0.365,29.129
|
||||||
|
0.0,0.8,60000,60000,3731.672,1.298,29.992
|
||||||
|
0.0,0.85,12000,12000,146.417,0.077,0.0
|
||||||
|
0.0,0.85,24000,24000,565.317,0.903,-2.0
|
||||||
|
0.0,0.85,36000,36000,1303.856,1.025,7.91
|
||||||
|
0.0,0.85,48000,48000,2328.478,1.158,11.004
|
||||||
|
0.0,0.85,60000,60000,3636.522,1.285,28.184
|
||||||
|
|
After Width: | Height: | Size: 248 KiB |
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.0,0.7,60000,SigType.WEIGHTED,5355.864,1.44,96.559
|
||||||
|
0.0,0.75,60000,SigType.WEIGHTED,4770.741,1.44,96.559
|
||||||
|
0.0,0.8,60000,SigType.WEIGHTED,4016.552,1.44,96.559
|
||||||
|
0.0,0.85,60000,SigType.WEIGHTED,3652.589,1.44,96.559
|
||||||
|
0.0,0.7,60000,SigType.SKYLINE,5320.789,1.44,96.559
|
||||||
|
0.0,0.75,60000,SigType.SKYLINE,4754.873,1.44,96.559
|
||||||
|
0.0,0.8,60000,SigType.SKYLINE,3993.905,1.44,96.559
|
||||||
|
0.0,0.85,60000,SigType.SKYLINE,3637.896,1.44,96.559
|
||||||
|
0.0,0.7,60000,SigType.DICHOTOMY,5314.17,1.44,96.559
|
||||||
|
0.0,0.75,60000,SigType.DICHOTOMY,4747.451,1.44,96.559
|
||||||
|
0.0,0.8,60000,SigType.DICHOTOMY,3987.966,1.44,96.559
|
||||||
|
0.0,0.85,60000,SigType.DICHOTOMY,3639.406,1.44,96.559
|
||||||
|
0.25,0.7,60000,SigType.WEIGHTED,5286.204,1.44,96.559
|
||||||
|
0.25,0.75,60000,SigType.WEIGHTED,4740.2,1.44,96.559
|
||||||
|
0.25,0.8,60000,SigType.WEIGHTED,3988.353,1.44,96.559
|
||||||
|
0.25,0.85,60000,SigType.WEIGHTED,3621.661,1.44,96.559
|
||||||
|
0.25,0.7,60000,SigType.SKYLINE,5272.151,1.44,96.559
|
||||||
|
0.25,0.75,60000,SigType.SKYLINE,4793.404,1.44,96.559
|
||||||
|
0.25,0.8,60000,SigType.SKYLINE,4270.868,1.44,96.559
|
||||||
|
0.25,0.85,60000,SigType.SKYLINE,3897.66,1.44,96.559
|
||||||
|
0.25,0.7,60000,SigType.DICHOTOMY,5280.093,1.44,96.559
|
||||||
|
0.25,0.75,60000,SigType.DICHOTOMY,4728.997,1.44,96.559
|
||||||
|
0.25,0.8,60000,SigType.DICHOTOMY,3971.004,1.44,96.559
|
||||||
|
0.25,0.85,60000,SigType.DICHOTOMY,3612.607,1.44,96.559
|
||||||
|
0.5,0.7,60000,SigType.WEIGHTED,5191.199,1.44,96.559
|
||||||
|
0.5,0.75,60000,SigType.WEIGHTED,4656.862,1.44,96.559
|
||||||
|
0.5,0.8,60000,SigType.WEIGHTED,3920.386,1.44,96.559
|
||||||
|
0.5,0.85,60000,SigType.WEIGHTED,3580.435,1.44,96.559
|
||||||
|
0.5,0.7,60000,SigType.SKYLINE,5180.493,1.44,96.559
|
||||||
|
0.5,0.75,60000,SigType.SKYLINE,4622.431,1.44,96.559
|
||||||
|
0.5,0.8,60000,SigType.SKYLINE,3871.093,1.44,96.559
|
||||||
|
0.5,0.85,60000,SigType.SKYLINE,3525.577,1.44,96.559
|
||||||
|
0.5,0.7,60000,SigType.DICHOTOMY,5112.984,1.44,96.559
|
||||||
|
0.5,0.75,60000,SigType.DICHOTOMY,4605.999,1.44,96.559
|
||||||
|
0.5,0.8,60000,SigType.DICHOTOMY,3876.706,1.44,96.559
|
||||||
|
0.5,0.85,60000,SigType.DICHOTOMY,3526.946,1.44,96.559
|
||||||
|
0.75,0.7,60000,SigType.WEIGHTED,5031.754,1.44,96.559
|
||||||
|
0.75,0.75,60000,SigType.WEIGHTED,4539.266,1.44,96.559
|
||||||
|
0.75,0.8,60000,SigType.WEIGHTED,3854.313,1.44,96.559
|
||||||
|
0.75,0.85,60000,SigType.WEIGHTED,3529.814,1.44,96.559
|
||||||
|
0.75,0.7,60000,SigType.SKYLINE,5037.338,1.44,96.559
|
||||||
|
0.75,0.75,60000,SigType.SKYLINE,4546.784,1.44,96.559
|
||||||
|
0.75,0.8,60000,SigType.SKYLINE,3843.47,1.44,96.559
|
||||||
|
0.75,0.85,60000,SigType.SKYLINE,3524.44,1.44,96.559
|
||||||
|
0.75,0.7,60000,SigType.DICHOTOMY,5252.169,1.44,96.559
|
||||||
|
0.75,0.75,60000,SigType.DICHOTOMY,4699.463,1.44,96.559
|
||||||
|
0.75,0.8,60000,SigType.DICHOTOMY,3928.414,1.44,96.559
|
||||||
|
0.75,0.85,60000,SigType.DICHOTOMY,3565.332,1.44,96.559
|
||||||
|
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 211 KiB |
|
After Width: | Height: | Size: 219 KiB |
|
After Width: | Height: | Size: 210 KiB |
@@ -0,0 +1,13 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.8,0.7,10000,NO FILTER,3180.351,0.686,63.961
|
||||||
|
0.8,0.75,10000,NO FILTER,2729.108,0.686,63.961
|
||||||
|
0.8,0.8,10000,NO FILTER,2185.09,0.686,63.961
|
||||||
|
0.8,0.85,10000,NO FILTER,1542.041,0.686,63.961
|
||||||
|
0.8,0.7,10000,CHECK FILTER,2329.334,0.686,63.961
|
||||||
|
0.8,0.75,10000,CHECK FILTER,2012.022,0.686,63.961
|
||||||
|
0.8,0.8,10000,CHECK FILTER,1609.739,0.686,63.961
|
||||||
|
0.8,0.85,10000,CHECK FILTER,1140.994,0.686,63.961
|
||||||
|
0.8,0.7,10000,NN FILTER,448.129,0.686,63.961
|
||||||
|
0.8,0.75,10000,NN FILTER,388.975,0.686,63.961
|
||||||
|
0.8,0.8,10000,NN FILTER,315.568,0.686,63.961
|
||||||
|
0.8,0.85,10000,NN FILTER,232.207,0.686,63.961
|
||||||
|
|
After Width: | Height: | Size: 159 KiB |
@@ -0,0 +1,13 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.8,0.7,10000,SigType.WEIGHTED,3215.981,0.686,64.16
|
||||||
|
0.8,0.75,10000,SigType.WEIGHTED,2754.485,0.686,64.16
|
||||||
|
0.8,0.8,10000,SigType.WEIGHTED,2201.524,0.686,64.16
|
||||||
|
0.8,0.85,10000,SigType.WEIGHTED,1558.372,0.686,64.16
|
||||||
|
0.8,0.7,10000,SigType.SKYLINE,3200.56,0.686,64.16
|
||||||
|
0.8,0.75,10000,SigType.SKYLINE,2757.303,0.686,64.16
|
||||||
|
0.8,0.8,10000,SigType.SKYLINE,55.38,0.686,64.16
|
||||||
|
0.8,0.85,10000,SigType.SKYLINE,20.134,0.686,64.16
|
||||||
|
0.8,0.7,10000,SigType.DICHOTOMY,3151.663,0.686,64.16
|
||||||
|
0.8,0.75,10000,SigType.DICHOTOMY,2613.546,0.686,64.16
|
||||||
|
0.8,0.8,10000,SigType.DICHOTOMY,52.873,0.686,64.16
|
||||||
|
0.8,0.85,10000,SigType.DICHOTOMY,19.331,0.686,64.16
|
||||||
|
|
After Width: | Height: | Size: 199 KiB |
@@ -0,0 +1,49 @@
|
|||||||
|
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||||
|
0.7,0.7,5000,NO FILTER,3145.41,0.391,28.309
|
||||||
|
0.7,0.75,5000,NO FILTER,2687.395,0.391,28.309
|
||||||
|
0.7,0.8,5000,NO FILTER,2244.686,0.391,28.309
|
||||||
|
0.7,0.85,5000,NO FILTER,1650.297,0.391,28.309
|
||||||
|
0.7,0.7,5000,CHECK FILTER,4118.279,0.391,28.309
|
||||||
|
0.7,0.75,5000,CHECK FILTER,3601.918,0.391,28.309
|
||||||
|
0.7,0.8,5000,CHECK FILTER,2874.443,0.391,28.309
|
||||||
|
0.7,0.85,5000,CHECK FILTER,2044.612,0.391,28.309
|
||||||
|
0.7,0.7,5000,NN FILTER,630.678,0.391,28.309
|
||||||
|
0.7,0.75,5000,NN FILTER,562.722,0.391,28.309
|
||||||
|
0.7,0.8,5000,NN FILTER,483.175,0.391,28.309
|
||||||
|
0.7,0.85,5000,NN FILTER,394.221,0.391,28.309
|
||||||
|
0.75,0.7,5000,NO FILTER,2189.373,0.391,28.309
|
||||||
|
0.75,0.75,5000,NO FILTER,1891.061,0.391,28.309
|
||||||
|
0.75,0.8,5000,NO FILTER,1516.5,0.391,28.309
|
||||||
|
0.75,0.85,5000,NO FILTER,1073.123,0.391,28.309
|
||||||
|
0.75,0.7,5000,CHECK FILTER,2222.872,0.391,28.309
|
||||||
|
0.75,0.75,5000,CHECK FILTER,1913.937,0.391,28.309
|
||||||
|
0.75,0.8,5000,CHECK FILTER,1542.112,0.391,28.309
|
||||||
|
0.75,0.85,5000,CHECK FILTER,1086.385,0.391,28.309
|
||||||
|
0.75,0.7,5000,NN FILTER,304.748,0.391,28.309
|
||||||
|
0.75,0.75,5000,NN FILTER,265.773,0.391,28.309
|
||||||
|
0.75,0.8,5000,NN FILTER,217.404,0.391,28.309
|
||||||
|
0.75,0.85,5000,NN FILTER,162.876,0.391,28.309
|
||||||
|
0.8,0.7,5000,NO FILTER,858.698,0.391,28.309
|
||||||
|
0.8,0.75,5000,NO FILTER,745.085,0.391,28.309
|
||||||
|
0.8,0.8,5000,NO FILTER,596.28,0.391,28.309
|
||||||
|
0.8,0.85,5000,NO FILTER,421.34,0.391,28.309
|
||||||
|
0.8,0.7,5000,CHECK FILTER,636.886,0.391,28.309
|
||||||
|
0.8,0.75,5000,CHECK FILTER,550.521,0.391,28.309
|
||||||
|
0.8,0.8,5000,CHECK FILTER,443.218,0.391,28.309
|
||||||
|
0.8,0.85,5000,CHECK FILTER,313.208,0.391,28.309
|
||||||
|
0.8,0.7,5000,NN FILTER,120.012,0.391,28.309
|
||||||
|
0.8,0.75,5000,NN FILTER,103.497,0.391,28.309
|
||||||
|
0.8,0.8,5000,NN FILTER,85.033,0.391,28.309
|
||||||
|
0.8,0.85,5000,NN FILTER,62.035,0.391,28.309
|
||||||
|
0.85,0.7,5000,NO FILTER,446.251,0.391,28.309
|
||||||
|
0.85,0.75,5000,NO FILTER,386.611,0.391,28.309
|
||||||
|
0.85,0.8,5000,NO FILTER,309.98,0.391,28.309
|
||||||
|
0.85,0.85,5000,NO FILTER,217.511,0.391,28.309
|
||||||
|
0.85,0.7,5000,CHECK FILTER,364.622,0.391,28.309
|
||||||
|
0.85,0.75,5000,CHECK FILTER,323.038,0.391,28.309
|
||||||
|
0.85,0.8,5000,CHECK FILTER,263.697,0.391,28.309
|
||||||
|
0.85,0.85,5000,CHECK FILTER,184.893,0.391,28.309
|
||||||
|
0.85,0.7,5000,NN FILTER,72.101,0.391,28.309
|
||||||
|
0.85,0.75,5000,NN FILTER,62.971,0.391,28.309
|
||||||
|
0.85,0.8,5000,NN FILTER,51.582,0.391,28.309
|
||||||
|
0.85,0.85,5000,NN FILTER,35.586,0.391,28.309
|
||||||
|