init
Update README.md
155
docu/experiments/README.md
Normal file
@@ -0,0 +1,155 @@
|
||||
### 🧪 Running the Experiments
|
||||
|
||||
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
|
||||
|
||||
---
|
||||
|
||||
#### 📊 1. Experiment Types
|
||||
|
||||
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
|
||||
|
||||
- **String Matching (DBLP Publication Titles)**
|
||||
- **Schema Matching (WebTables)**
|
||||
- **Inclusion Dependency Discovery (WebTable Columns)**
|
||||
|
||||
Exact descriptions can be found in the official paper.
|
||||
|
||||
---
|
||||
|
||||
#### 📦 2. WebSchema Inclusion Dependency Setup
|
||||
|
||||
To run the **WebSchema + Inclusion Dependency** experiments:
|
||||
|
||||
1. Download the pre-extracted dataset from
|
||||
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
|
||||
2. Place the `.json` files in the `data/webtables/` directory
|
||||
*(create the folder if it does not exist)*.
|
||||
|
||||
---
|
||||
|
||||
#### 🚀 3. Running the Experiments
|
||||
|
||||
To execute the core experiments from the paper:
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
### 📈 4. Results Overview
|
||||
|
||||
We compared our results with those presented in the original SilkMoth paper.
|
||||
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
|
||||
|
||||
All the results can be found in the folder `results`.
|
||||
|
||||
The **left** diagrams are from the paper and the **right** are ours.
|
||||
|
||||
> 💡 *Recent performance enhancements leverage `scipy`’s C-accelerated matching, replacing the original `networkx`-based approach.
|
||||
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Inclusion Dependency
|
||||
|
||||
> **Goal**: Check if each reference set is contained within source sets.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Reduction Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Schema Matching (WebTables)
|
||||
|
||||
> **Goal**: Detect related set pairs within a single source set.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
### 🔍 String Matching (DBLP Publication Titles)
|
||||
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
|
||||
based on **edit similarity** and **q-gram** tokenization.
|
||||
> SciPy was used here.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/10k-set-size/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
---
|
||||
|
||||
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
|
||||
|
||||
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
|
||||
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
|
||||
> shows the Filter run alongside the brute-force bipartite matching method without any
|
||||
> optimization pipeline. The results clearly demonstrate a dramatic improvement
|
||||
> in runtime efficiency when using SilkMoth.
|
||||
|
||||
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Additional: Schema Matching with GitHub WebTables
|
||||
|
||||
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoth’s performance across different domains.
|
||||
**Left:** Matching with one reference set.
|
||||
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
|
||||
The results show no significant difference, indicating consistent behavior across varying datasets.
|
||||
|
||||
<p align="center">
|
||||
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
|
||||
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
|
||||
</p>
|
||||
|
After Width: | Height: | Size: 125 KiB |
|
After Width: | Height: | Size: 151 KiB |
|
After Width: | Height: | Size: 166 KiB |
|
After Width: | Height: | Size: 241 KiB |
|
After Width: | Height: | Size: 207 KiB |
64
docu/experiments/results/plot.py
Normal file
@@ -0,0 +1,64 @@
|
||||
from experiments.utils import plot_elapsed_times
|
||||
import csv
|
||||
|
||||
import csv
|
||||
|
||||
labels = []
|
||||
elapsed_times = []
|
||||
|
||||
def read_csv_add_data(filename, labels, elapsed_times):
|
||||
with open(filename, newline='') as csvfile:
|
||||
reader = csv.reader(csvfile)
|
||||
next(reader) # skip header
|
||||
times = []
|
||||
current_label = None
|
||||
for row in reader:
|
||||
sim_thresh = float(row[0])
|
||||
label = row[4]
|
||||
elapsed = float(row[5])
|
||||
|
||||
if sim_thresh == 0.5:
|
||||
if current_label != label:
|
||||
# New label group started
|
||||
if times:
|
||||
# Save times of previous label if not empty
|
||||
elapsed_times.append(times)
|
||||
times = [elapsed]
|
||||
current_label = label
|
||||
else:
|
||||
times.append(elapsed)
|
||||
|
||||
# When 4 times collected, append and reset
|
||||
if len(times) == 4:
|
||||
elapsed_times.append(times)
|
||||
times = []
|
||||
current_label = None
|
||||
|
||||
if label not in labels:
|
||||
labels.append(label)
|
||||
|
||||
# In case last label times were not appended
|
||||
if times:
|
||||
elapsed_times.append(times)
|
||||
|
||||
# Read first CSV
|
||||
read_csv_add_data('inclusion_dependency/raw_matching_experiment_results.csv', labels, elapsed_times)
|
||||
|
||||
# Read second CSV
|
||||
read_csv_add_data('inclusion_dependency/inclusion_dependency_filter_experiment_results.csv', labels, elapsed_times)
|
||||
|
||||
print("Labels:", labels)
|
||||
print("Elapsed Times:", elapsed_times)
|
||||
|
||||
# Then plot
|
||||
file_name_prefix = "inclusion_dependency_filter_combined_raw"
|
||||
folder_path = ""
|
||||
|
||||
_ = plot_elapsed_times(
|
||||
related_thresholds=[0.7, 0.75, 0.8, 0.85],
|
||||
elapsed_times_list=elapsed_times,
|
||||
fig_text=f"{file_name_prefix} (α = 0.5)",
|
||||
legend_labels=labels,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_α=0.5.png"
|
||||
)
|
||||
|
||||
|
After Width: | Height: | Size: 171 KiB |
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 188 KiB |
|
After Width: | Height: | Size: 248 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 159 KiB |
|
After Width: | Height: | Size: 199 KiB |
|
After Width: | Height: | Size: 221 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_filter.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_red.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_scal.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
docu/experiments/silkmoth_results/inclusion_dep_sig.png
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_filter.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_scal.png
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
docu/experiments/silkmoth_results/schema_matching_sig.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_filter.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_scal.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
BIN
docu/experiments/silkmoth_results/string_matching_sig.png
Normal file
|
After Width: | Height: | Size: 53 KiB |