## SilkMoth Demo

### Related Set Discovery task under Set‑Containment using Jaccard Similarity

Import of all required modules:

In [24]:
import sys
sys.path.append("src")

from silkmoth.tokenizer import Tokenizer
from silkmoth.inverted_index import InvertedIndex
from silkmoth.signature_generator import SignatureGenerator
from silkmoth.candidate_selector import CandidateSelector
from silkmoth.verifier import Verifier
from silkmoth.silkmoth_engine import SilkMothEngine


from silkmoth.utils import jaccard_similarity, contain, edit_similarity, similar, SigType

import matplotlib.pyplot as plt
from IPython.display import display, Markdown

import numpy as np
import pandas as pd

Define example related dataset from "SilkMoth" paper (reference set **R** and source sets **S**)


In [25]:
# Location Dataset
reference_set = [
 '77 Mass Ave Boston MA',
 '5th St 02115 Seattle WA',
 '77 5th St Chicago IL'
]

# Address Dataset
source_sets = [
 ['Mass Ave St Boston 02115','77 Mass 5th St Boston','77 Mass Ave 5th 02115'],
 ['77 Boston MA','77 5th St Boston 02115','77 Mass Ave 02115 Seattle'],
 ['77 Mass Ave 5th Boston MA','Mass Ave Chicago IL','77 Mass Ave St'],
 ['77 Mass Ave MA','5th St 02115 Seattle WA','77 5th St Boston Seattle']
]

# thresholds & q
δ = 0.7
α = 0.0
q = 3

display(Markdown("**Reference set (R):**"))
for i, r in enumerate(reference_set):
 display(Markdown(f"- R[{i}]: “{r}”"))
display(Markdown("**Source sets (S):**"))
for j, S in enumerate(source_sets):
 display(Markdown(f"- S[{j}]: “{' | '.join(S)}”"))

**Reference set (R):**

- R[0]: “77 Mass Ave Boston MA”

- R[1]: “5th St 02115 Seattle WA”

- R[2]: “77 5th St Chicago IL”

**Source sets (S):**

- S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”

- S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”

- S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”

- S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”

### 1. Tokenization
Tokenize each element of R and each S using Jaccard Similarity (whitespace tokens)


In [26]:
tokenizer = Tokenizer(jaccard_similarity, q)
tokenized_R = tokenizer.tokenize(reference_set)
tokenized_S = [tokenizer.tokenize(S) for S in source_sets]

display(Markdown("**Tokenized Reference set (R):**"))
for i, toks in enumerate(tokenized_R):
 display(Markdown(f"- Tokens of R[{i}]: {toks}"))

display(Markdown("**Tokenized Source sets (S):**"))
for i, toks in enumerate(tokenized_S):
 display(Markdown(f"- Tokens of S[{i}]: {toks}"))

**Tokenized Reference set (R):**

- Tokens of R[0]: {'Ave', 'MA', '77', 'Boston', 'Mass'}

- Tokens of R[1]: {'5th', 'Seattle', 'St', 'WA', '02115'}

- Tokens of R[2]: {'77', '5th', 'IL', 'St', 'Chicago'}

**Tokenized Source sets (S):**

- Tokens of S[0]: [{'Ave', 'Boston', 'St', 'Mass', '02115'}, {'77', 'Boston', '5th', 'St', 'Mass'}, {'Ave', '77', '5th', 'Mass', '02115'}]

- Tokens of S[1]: [{'Boston', 'MA', '77'}, {'77', 'Boston', '5th', 'St', '02115'}, {'Ave', '77', 'Seattle', 'Mass', '02115'}]

- Tokens of S[2]: [{'Ave', 'MA', '77', 'Boston', '5th', 'Mass'}, {'IL', 'Ave', 'Mass', 'Chicago'}, {'St', 'Ave', 'Mass', '77'}]

- Tokens of S[3]: [{'Ave', 'Mass', '77', 'MA'}, {'5th', 'Seattle', 'St', 'WA', '02115'}, {'77', 'Boston', '5th', 'Seattle', 'St'}]

### 2. Build Inverted Index
Builds an inverted index on the tokenized source sets and shows an example lookup.

In [27]:
index = InvertedIndex(tokenized_S)
display(Markdown(f"- Index built over {len(source_sets)} source sets."))
display(Markdown(f"- Example: token “Mass” appears in {index.get_indexes('Mass')}"))


- Index built over 4 source sets.

- Example: token “Mass” appears in [(0, 0), (0, 1), (0, 2), (1, 2), (2, 0), (2, 1), (2, 2), (3, 0)]

### 3. Signature Generation

Generates the weighted signature for R given δ, α (here α=0), using Jaccard Similarity.

In [28]:
sig_gen = SignatureGenerator()
signature = sig_gen.get_signature(
 tokenized_R, index,
 delta=δ, alpha=α,
 sig_type=SigType.WEIGHTED,
 sim_fun=jaccard_similarity,
 q=q
)
display(Markdown(f"- Selected signature tokens: **{signature}**"))

- Selected signature tokens: **['Chicago', 'WA', 'IL', '5th']**

### 4. Initial Candidate Selection

Looks up each signature token in the inverted index to form the candidate set.


In [29]:
cand_sel = CandidateSelector(
 similarity_func=jaccard_similarity,
 sim_metric=contain,
 related_thresh=δ,
 sim_thresh=α,
 q=q
)

initial_cands = cand_sel.get_candidates(signature, index, len(tokenized_R))
display(Markdown(f"- Candidate set indices: **{sorted(initial_cands)}**"))
for j in sorted(initial_cands):
 display(Markdown(f" - S[{j}]: “{' | '.join(source_sets[j])}”"))

- Candidate set indices: **[0, 1, 2, 3]**

 - S[0]: “Mass Ave St Boston 02115 | 77 Mass 5th St Boston | 77 Mass Ave 5th 02115”

 - S[1]: “77 Boston MA | 77 5th St Boston 02115 | 77 Mass Ave 02115 Seattle”

 - S[2]: “77 Mass Ave 5th Boston MA | Mass Ave Chicago IL | 77 Mass Ave St”

 - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”

### 5. Check Filter
Prunes candidates by ensuring each matched element passes the local similarity bound.


In [30]:
filtered_cands, match_map = cand_sel.check_filter(
 tokenized_R, set(signature), initial_cands, index
)
display(Markdown(f"**Surviving after check filter:** **{sorted(filtered_cands)}**"))
for j in sorted(filtered_cands):
 display(Markdown(f"S[{j}] matched:"))
 for r_idx, sim in match_map[j].items():
 sim_text = f"{sim:.3f}"
 display(Markdown(f" • R[{r_idx}] “{reference_set[r_idx]}” → sim = {sim_text}"))
 
 matches = match_map.get(j, {})
 if matches:
 best_sim = max(matches.values())
 num_matches = len(matches)
 display(Markdown(f" → Best sim: **{best_sim:.3f}** | Matched elements: **{num_matches}**"))
 else:
 display(Markdown(f"No elements passed similarity checks."))


**Surviving after check filter:** **[0, 1, 3]**

S[0] matched:

 • R[2] “77 5th St Chicago IL” → sim = 0.429

 → Best sim: **0.429** | Matched elements: **1**

S[1] matched:

 • R[2] “77 5th St Chicago IL” → sim = 0.429

 → Best sim: **0.429** | Matched elements: **1**

S[3] matched:

 • R[1] “5th St 02115 Seattle WA” → sim = 1.000

 • R[2] “77 5th St Chicago IL” → sim = 0.429

 → Best sim: **1.000** | Matched elements: **2**

### 6. Nearest‑Neighbor Filter

Further prunes via nearest‑neighbor upper bounds on total matching score.


In [31]:
nn_filtered = cand_sel.nn_filter(
 tokenized_R, set(signature), filtered_cands,
 index, threshold=δ, match_map=match_map
)
display(Markdown(f"- Surviving after NN filter: **{sorted(nn_filtered)}**"))
for j in nn_filtered:
 display(Markdown(f" - S[{j}]: “{' | '.join(source_sets[j])}”"))


- Surviving after NN filter: **[3]**

 - S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle”

### 7. Verification

Runs the bipartite max‑matching on the remaining candidates and outputs the final related sets.


In [32]:
verifier = Verifier(δ, contain, jaccard_similarity, sim_thresh=α, reduction=False)
results = verifier.get_related_sets(tokenized_R, nn_filtered, index)

if results:
 display(Markdown(f"Final related sets (score ≥ {δ}):"))
 for j, score in results:
 display(Markdown(f" • S[{j}]: “{' | '.join(source_sets[j])}” → **{score:.3f}**"))
else:
 display(Markdown("- No sets passed verification."))


Final related sets (score ≥ 0.7):

 • S[3]: “77 Mass Ave MA | 5th St 02115 Seattle WA | 77 5th St Boston Seattle” → **0.743**