init
155
experiments/README.md
Normal file
@@ -0,0 +1,155 @@
|
||||
### 🧪 Running the Experiments
|
||||
|
||||
This project includes multiple experiments to evaluate the performance and accuracy of our Python implementation of **SilkMoth**.
|
||||
|
||||
---
|
||||
|
||||
#### 📊 1. Experiment Types
|
||||
|
||||
You can replicate and customize the following types of experiments using different configurations (e.g., filters, signature strategies, reduction techniques):
|
||||
|
||||
- **String Matching (DBLP Publication Titles)**
|
||||
- **Schema Matching (WebTables)**
|
||||
- **Inclusion Dependency Discovery (WebTable Columns)**
|
||||
|
||||
Exact descriptions can be found in the official paper.
|
||||
|
||||
---
|
||||
|
||||
#### 📦 2. WebSchema Inclusion Dependency Setup
|
||||
|
||||
To run the **WebSchema + Inclusion Dependency** experiments:
|
||||
|
||||
1. Download the pre-extracted dataset from
|
||||
[📥 this link](https://tubcloud.tu-berlin.de/s/D4ngEfdn3cJ3pxF).
|
||||
2. Place the `.json` files in the `data/webtables/` directory
|
||||
*(create the folder if it does not exist)*.
|
||||
|
||||
---
|
||||
|
||||
#### 🚀 3. Running the Experiments
|
||||
|
||||
To execute the core experiments from the paper:
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
### 📈 4. Results Overview
|
||||
|
||||
We compared our results with those presented in the original SilkMoth paper.
|
||||
Although exact reproduction is not possible due to language differences (Python vs C++) and dataset variations, overall **performance trends align well**.
|
||||
|
||||
All the results can be found in the folder `results`.
|
||||
|
||||
The **left** diagrams are from the paper and the **right** are ours.
|
||||
|
||||
> 💡 *Recent performance enhancements leverage `scipy`’s C-accelerated matching, replacing the original `networkx`-based approach.
|
||||
> Unless otherwise specified, the diagrams shown are generated using the `networkx` implementation.*
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Inclusion Dependency
|
||||
|
||||
> **Goal**: Check if each reference set is contained within source sets.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_filter_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_sig_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Reduction Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_red.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_reduction_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/inclusion_dep_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_scalability_experiment_α=0.5.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Schema Matching (WebTables)
|
||||
|
||||
> **Goal**: Detect related set pairs within a single source set.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_sig_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/schema_matching_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/schema_matching/schema_matching_scalability_experiment_α=0.0.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
### 🔍 String Matching (DBLP Publication Titles)
|
||||
>**Goal:** Detect related titles within the dataset using the extended SilkMoth pipeline
|
||||
based on **edit similarity** and **q-gram** tokenization.
|
||||
> SciPy was used here.
|
||||
|
||||
**Filter Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_filter.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/string_matching_filter_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Signature Comparison**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_sig.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/10k-set-size/string_matching_sig_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
|
||||
**Scalability**
|
||||
<p align="center">
|
||||
<img src="silkmoth_results/string_matching_scal.png" alt="Our Result" width="45%" />
|
||||
<img src="results/string_matching/string_matching_scalability_experiment_α=0.8.png" alt="Original Result" width="45%" />
|
||||
</p>
|
||||
---
|
||||
|
||||
### 🔍 Additional: Inclusion Dependency SilkMoth Filter compared with no SilkMoth
|
||||
|
||||
> In this analysis, we focus exclusively on SilkMoth. But how does it compare to a
|
||||
> brute-force approach that skips the SilkMoth pipeline entirely? The graph below
|
||||
> shows the Filter run alongside the brute-force bipartite matching method without any
|
||||
> optimization pipeline. The results clearly demonstrate a dramatic improvement
|
||||
> in runtime efficiency when using SilkMoth.
|
||||
|
||||
|
||||
<img src="results/inclusion_dependency/inclusion_dependency_filter_combined_raw_experiment_α=0.5.png" alt="WebTables Result" />
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 🔍 Additional: Schema Matching with GitHub WebTables
|
||||
|
||||
> Similar to Schema Matching, this experiment uses a GitHub WebTable as a fixed reference set and matches it against other sets. The goal is to evaluate SilkMoth’s performance across different domains.
|
||||
**Left:** Matching with one reference set.
|
||||
**Right:** Matching with WebTable Corpus and GitHub WebTable datasets.
|
||||
The results show no significant difference, indicating consistent behavior across varying datasets.
|
||||
|
||||
<p align="center">
|
||||
<img src="results/schema_matching/schema_matching_filter_experiment_α=0.5.png" alt="WebTables Result" width="45%" />
|
||||
<img src="results/schema_matching/github_webtable_schema_matching_experiment_α=0.5.png" alt="GitHub Table Result" width="45%" />
|
||||
</p>
|
||||
0
experiments/data/__init__.py
Normal file
132466
experiments/data/dblp/DBLP_100k.csv
Normal file
0
experiments/data/webtables/__init__.py
Normal file
174
experiments/data_loader.py
Normal file
@@ -0,0 +1,174 @@
|
||||
import random
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
from utils import *
|
||||
|
||||
|
||||
class DataLoader:
|
||||
def __init__(self, data_path):
|
||||
self.data_path = data_path
|
||||
self.files = os.listdir(data_path)
|
||||
|
||||
def load_webtable_columns_randomized(self, reference_set_amount: int, source_set_amount: int) -> tuple[list, list]:
|
||||
"""
|
||||
Get randomized reference sets and source sets of webtable columns.
|
||||
Reference sets are subsets of the source sets.
|
||||
Only columns with 4 or more different elements are considered.
|
||||
Only considering columns with non-numeric values.
|
||||
|
||||
Args:
|
||||
reference_set_amount (int): Number of reference sets to return.
|
||||
source_set_amount (int): Number of source sets to return.
|
||||
Returns:
|
||||
tuple: A tuple containing a list of reference sets and a list of source sets.
|
||||
"""
|
||||
# Basic validation of input parameters
|
||||
if reference_set_amount < 1 or source_set_amount < 2:
|
||||
raise ValueError("reference_set_amount must be at least 1 and source_set_amount must be at least 2")
|
||||
if reference_set_amount >= source_set_amount:
|
||||
raise ValueError("reference_set_amount must be smaller than source_set_amount")
|
||||
if reference_set_amount > len(self.files):
|
||||
raise ValueError("reference_set_amount must be smaller than the number of files in data_path")
|
||||
if source_set_amount > len(self.files):
|
||||
raise ValueError("source_set_amount must be smaller than the number of files in data_path")
|
||||
if len(self.files) == 0:
|
||||
raise ValueError("data_path does not contain any files")
|
||||
|
||||
|
||||
# Randomly select a reference set and source sets
|
||||
source_set_nums = random.sample(range(len(self.files)), source_set_amount)
|
||||
|
||||
# Pick source_set_amount of columns which have at least 4 different elements
|
||||
source_sets = []
|
||||
while len(source_sets) < source_set_amount:
|
||||
# Pick a random number from the source_set_nums
|
||||
source_set_num = random.choice(source_set_nums)
|
||||
file_path = os.path.join(self.data_path, self.files[source_set_num])
|
||||
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
json_data = json.load(file)
|
||||
if "relation" in json_data and isinstance(json_data["relation"], list):
|
||||
# pick random column
|
||||
col = random.randint(0, len(json_data["relation"]) - 1)
|
||||
col = json_data["relation"][col]
|
||||
|
||||
# Check if the column has at least 4 different elements and contains no numeric values
|
||||
if len(set(col)) >= 4:
|
||||
if all(not is_convertible_to_number(value) and len(value) > 0 for value in col):
|
||||
# Add the column to the source sets
|
||||
source_sets.append(col)
|
||||
print(f"Source set number {len(source_sets)} loaded")
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error loading JSON file: {e}")
|
||||
|
||||
# Randomly select reference sets from the source sets
|
||||
reference_sets = random.sample(source_sets, reference_set_amount)
|
||||
return reference_sets, source_sets
|
||||
|
||||
def load_webtable_reference_sets_element_restriction(self, source_set: list, element_restriction: int) -> list:
|
||||
"""
|
||||
Get a reference set of webtable columns with a specific element restriction.
|
||||
Restriction is the minimal number of elements allowed in the reference set.
|
||||
|
||||
Args:
|
||||
source_set (list): The source set to use for generating the reference set.
|
||||
element_restriction (int): The number of elements in the reference set.
|
||||
Returns:
|
||||
list: A list of reference sets.
|
||||
"""
|
||||
if element_restriction < 1:
|
||||
raise ValueError("element_restriction must be at least 1")
|
||||
|
||||
reference_sets = []
|
||||
|
||||
while len(reference_sets) < 1000:
|
||||
# Randomly select a column from the source set
|
||||
col = random.choice(source_set)
|
||||
|
||||
# Check if the column has at least element_restriction different elements
|
||||
if len(col) >= element_restriction:
|
||||
reference_sets.append(col)
|
||||
print(f"Reference set number {len(reference_sets)} loaded")
|
||||
|
||||
return reference_sets
|
||||
|
||||
def load_webtable_schemas_randomized(self, set_amount: int) -> list:
|
||||
if set_amount < 2:
|
||||
raise ValueError("source_set_amount must be at least 2")
|
||||
# Random sequence of table numbers
|
||||
table_nums = random.sample(range(len(self.files)), len(self.files))
|
||||
|
||||
schema_sets = []
|
||||
|
||||
i = 0
|
||||
while len(schema_sets) < set_amount and i < len(table_nums):
|
||||
try:
|
||||
# Load the schema for the current table number
|
||||
schema = self.load_single_webtable_schema(table_nums[i])
|
||||
schema_sets.append(schema)
|
||||
print(f"Schema set number {len(schema_sets)} loaded")
|
||||
i += 1
|
||||
except ValueError as e:
|
||||
print(f"Skipping table number {table_nums[i]} due to error: {e}")
|
||||
i += 1
|
||||
|
||||
return schema_sets
|
||||
|
||||
def load_single_webtable_schema(self, reference_set_num: int) -> list:
|
||||
# Load the webtable schema for the given reference set number
|
||||
if reference_set_num < 0 or reference_set_num >= len(self.files):
|
||||
raise IndexError("reference_set_num is out of range")
|
||||
|
||||
# Get the file at the specified position
|
||||
file_path = os.path.join(self.data_path, self.files[reference_set_num])
|
||||
|
||||
# Load and return the JSON content
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
json_data = json.load(file)
|
||||
if "relation" in json_data and isinstance(json_data["relation"], list):
|
||||
schema = [relation[0] for relation in json_data["relation"]]
|
||||
if len(schema) == 0:
|
||||
raise ValueError("Schema is empty")
|
||||
|
||||
if all(not is_convertible_to_number(col) for col in schema):
|
||||
# remove "" empty strings from the schema
|
||||
schema = [col for col in schema if len(col) > 0]
|
||||
if len(schema) == 0:
|
||||
raise ValueError("Schema contains only empty strings")
|
||||
return schema
|
||||
else:
|
||||
raise ValueError("Schema contains numeric values or is empty")
|
||||
else:
|
||||
raise ValueError("JSON does not contain a valid 'relation' key or it is not a list")
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error loading JSON file: {e}")
|
||||
|
||||
|
||||
|
||||
|
||||
def load_dblp_titles(self, data_path: str) -> list:
|
||||
"""
|
||||
Load DBLP paper titles from a CSV file.
|
||||
|
||||
Args:
|
||||
data_path (str): Path to CSV file containing a column 'title'.
|
||||
|
||||
Returns:
|
||||
list: A list of title strings.
|
||||
"""
|
||||
|
||||
if not os.path.exists(data_path):
|
||||
raise FileNotFoundError(f"DBLP CSV file not found: {data_path}")
|
||||
|
||||
df = pd.read_csv(data_path)
|
||||
if "title" not in df.columns:
|
||||
raise ValueError("CSV must contain a 'title' column")
|
||||
|
||||
titles = df["title"].dropna().tolist()
|
||||
return titles
|
||||
|
||||
|
||||
469
experiments/experiments.py
Normal file
@@ -0,0 +1,469 @@
|
||||
import time
|
||||
from math import floor
|
||||
|
||||
from silkmoth.silkmoth_engine import SilkMothEngine
|
||||
from silkmoth.utils import SigType, edit_similarity, contain, jaccard_similarity
|
||||
from silkmoth.verifier import Verifier
|
||||
from silkmoth.tokenizer import Tokenizer
|
||||
from src.silkmoth.silkmoth_engine import SilkMothEngine
|
||||
from src.silkmoth.utils import SigType, edit_similarity
|
||||
from utils import *
|
||||
|
||||
|
||||
def run_experiment_filter_schemes(related_thresholds, similarity_thresholds, labels, source_sets, reference_sets,
|
||||
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||
"""
|
||||
Parameters
|
||||
----------
|
||||
related_thresholds : list[float]
|
||||
Thresholds for determining relatedness between sets.
|
||||
similarity_thresholds : list[float]
|
||||
Thresholds for measuring similarity between sets.
|
||||
labels : list[str]
|
||||
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
|
||||
source_sets : list[]
|
||||
The sets to be compared against the reference sets or against itself.
|
||||
reference_sets : list[]
|
||||
The sets used as the reference for comparison.
|
||||
sim_metric : callable
|
||||
The metric function used to evaluate similarity between sets.
|
||||
sim_func : callable
|
||||
The function used to calculate similarity scores.
|
||||
is_search : bool
|
||||
Flag indicating whether to perform a search operation or discovery.
|
||||
file_name_prefix : str
|
||||
Prefix for naming output files generated during the experiment.
|
||||
folder_path: str
|
||||
Path to the folder where results will be saved.
|
||||
"""
|
||||
|
||||
# Calculate index time and RAM usage for the SilkMothEngine
|
||||
in_index_time_start = time.time()
|
||||
initial_ram = measure_ram_usage()
|
||||
|
||||
# Initialize and run the SilkMothEngine
|
||||
silk_moth_engine = SilkMothEngine(
|
||||
related_thresh=0,
|
||||
source_sets=source_sets,
|
||||
sim_metric=sim_metric,
|
||||
sim_func=sim_func,
|
||||
sim_thresh=0,
|
||||
is_check_filter=False,
|
||||
is_nn_filter=False,
|
||||
)
|
||||
|
||||
in_index_time_end = time.time()
|
||||
final_ram = measure_ram_usage()
|
||||
|
||||
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||
in_index_ram_usage = final_ram - initial_ram
|
||||
|
||||
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||
|
||||
for sim_thresh in similarity_thresholds:
|
||||
|
||||
# Check if the similarity function is edit similarity
|
||||
if sim_func == edit_similarity:
|
||||
# calc the maximum possible q-gram size based on sim_thresh
|
||||
upper_bound_q = sim_thresh/(1 - sim_thresh)
|
||||
q = floor(upper_bound_q)
|
||||
|
||||
print(f"Using q = {q} for edit similarity with sim_thresh = {sim_thresh}")
|
||||
print(f"Rebuilding Inverted Index with q = {q}...")
|
||||
silk_moth_engine.set_q(q)
|
||||
|
||||
|
||||
|
||||
elapsed_times_final = []
|
||||
silk_moth_engine.set_alpha(sim_thresh)
|
||||
for label in labels:
|
||||
|
||||
elapsed_times = []
|
||||
for idx, related_thresh in enumerate(related_thresholds):
|
||||
|
||||
print(
|
||||
f"\nRunning SilkMoth {file_name_prefix} with α = {sim_thresh}, θ = {related_thresh}, label = {label}")
|
||||
|
||||
# checks for filter runs
|
||||
if label == "CHECK FILTER":
|
||||
silk_moth_engine.is_check_filter = True
|
||||
silk_moth_engine.is_nn_filter = False
|
||||
elif label == "NN FILTER":
|
||||
silk_moth_engine.is_check_filter = False
|
||||
silk_moth_engine.is_nn_filter = True
|
||||
else: # NO FILTER
|
||||
silk_moth_engine.is_check_filter = False
|
||||
silk_moth_engine.is_nn_filter = False
|
||||
|
||||
# checks for signature scheme runs
|
||||
if label == SigType.WEIGHTED:
|
||||
silk_moth_engine.set_signature_type(SigType.WEIGHTED)
|
||||
elif label == SigType.SKYLINE:
|
||||
silk_moth_engine.set_signature_type(SigType.SKYLINE)
|
||||
elif label == SigType.DICHOTOMY:
|
||||
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
|
||||
|
||||
silk_moth_engine.set_related_threshold(related_thresh)
|
||||
# Measure the time taken to search for related sets
|
||||
time_start = time.time()
|
||||
|
||||
# Used for search to see how many candidates were found and how many were removed
|
||||
candidates_amount = 0
|
||||
candidates_after = 0
|
||||
related_sets_found = 0
|
||||
if is_search:
|
||||
for ref_id, ref_set in enumerate(reference_sets):
|
||||
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||
ref_set)
|
||||
candidates_amount += candidates_amount_temp
|
||||
candidates_after += candidates_removed_temp
|
||||
related_sets_found += len(related_sets_temp)
|
||||
else:
|
||||
# If not searching, we are discovering sets
|
||||
silk_moth_engine.discover_sets(source_sets)
|
||||
|
||||
time_end = time.time()
|
||||
elapsed_time = time_end - time_start
|
||||
|
||||
elapsed_times.append(elapsed_time)
|
||||
|
||||
# Create a new data dictionary for each iteration
|
||||
if is_search:
|
||||
data_overall = {
|
||||
"similarity_threshold": sim_thresh,
|
||||
"related_threshold": related_thresh,
|
||||
"reference_set_amount": len(reference_sets),
|
||||
"source_set_amount": len(source_sets),
|
||||
"label": label,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
"candidates_amount": candidates_amount,
|
||||
"candidates_amount_after_filtering": candidates_after,
|
||||
"related_sets_found": related_sets_found,
|
||||
}
|
||||
else:
|
||||
data_overall = {
|
||||
"similarity_threshold": sim_thresh,
|
||||
"related_threshold": related_thresh,
|
||||
"source_set_amount": len(source_sets),
|
||||
"label": label,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
}
|
||||
# Save results to a CSV file
|
||||
save_experiment_results_to_csv(
|
||||
results=data_overall,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||
)
|
||||
|
||||
elapsed_times_final.append(elapsed_times)
|
||||
_ = plot_elapsed_times(
|
||||
related_thresholds=related_thresholds,
|
||||
elapsed_times_list=elapsed_times_final,
|
||||
fig_text=f"{file_name_prefix} (α = {sim_thresh})",
|
||||
legend_labels=labels,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_α={sim_thresh}.png"
|
||||
)
|
||||
|
||||
|
||||
def run_reduction_experiment(related_thresholds, similarity_threshold, labels, source_sets, reference_sets,
|
||||
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||
"""
|
||||
Parameters
|
||||
----------
|
||||
related_thresholds : list[float]
|
||||
Thresholds for determining relatedness between sets.
|
||||
similarity_threshold : float
|
||||
Thresholds for measuring similarity between sets.
|
||||
labels : list[str]
|
||||
Labels indicating the type of setting applied (e.g., "NO FILTER", "CHECK FILTER", "WEIGHTED").
|
||||
source_sets : list[]
|
||||
The sets to be compared against the reference sets or against itself.
|
||||
reference_sets : list[]
|
||||
The sets used as the reference for comparison.
|
||||
sim_metric : callable
|
||||
The metric function used to evaluate similarity between sets.
|
||||
sim_func : callable
|
||||
The function used to calculate similarity scores.
|
||||
is_search : bool
|
||||
Flag indicating whether to perform a search operation or discovery.
|
||||
file_name_prefix : str
|
||||
Prefix for naming output files generated during the experiment.
|
||||
folder_path: str
|
||||
Path to the folder where results will be saved.
|
||||
"""
|
||||
in_index_time_start = time.time()
|
||||
initial_ram = measure_ram_usage()
|
||||
|
||||
# Initialize and run the SilkMothEngine
|
||||
silk_moth_engine = SilkMothEngine(
|
||||
related_thresh=0,
|
||||
source_sets=source_sets,
|
||||
sim_metric=sim_metric,
|
||||
sim_func=sim_func,
|
||||
sim_thresh=similarity_threshold,
|
||||
is_check_filter=False,
|
||||
is_nn_filter=False,
|
||||
)
|
||||
# use dichotomy signature scheme for this experiment
|
||||
silk_moth_engine.set_signature_type(SigType.DICHOTOMY)
|
||||
|
||||
in_index_time_end = time.time()
|
||||
final_ram = measure_ram_usage()
|
||||
|
||||
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||
in_index_ram_usage = final_ram - initial_ram
|
||||
|
||||
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||
|
||||
elapsed_times_final = []
|
||||
for label in labels:
|
||||
|
||||
if label == "REDUCTION":
|
||||
silk_moth_engine.set_reduction(True)
|
||||
elif label == "NO REDUCTION":
|
||||
silk_moth_engine.set_reduction(False)
|
||||
|
||||
elapsed_times = []
|
||||
for idx, related_thresh in enumerate(related_thresholds):
|
||||
|
||||
print(
|
||||
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, label = {label}")
|
||||
|
||||
silk_moth_engine.set_related_threshold(related_thresh)
|
||||
# Measure the time taken to search for related sets
|
||||
time_start = time.time()
|
||||
|
||||
# Used for search to see how many candidates were found and how many were removed
|
||||
candidates_amount = 0
|
||||
candidates_after = 0
|
||||
if is_search:
|
||||
for ref_id, ref_set in enumerate(reference_sets):
|
||||
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||
ref_set)
|
||||
candidates_amount += candidates_amount_temp
|
||||
candidates_after += candidates_removed_temp
|
||||
else:
|
||||
# If not searching, we are discovering sets
|
||||
silk_moth_engine.discover_sets(source_sets)
|
||||
|
||||
time_end = time.time()
|
||||
elapsed_time = time_end - time_start
|
||||
|
||||
elapsed_times.append(elapsed_time)
|
||||
|
||||
# Create a new data dictionary for each iteration
|
||||
if is_search:
|
||||
data_overall = {
|
||||
"similarity_threshold": similarity_threshold,
|
||||
"related_threshold": related_thresh,
|
||||
"reference_set_amount": len(reference_sets),
|
||||
"source_set_amount": len(source_sets),
|
||||
"label": label,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
"candidates_amount": candidates_amount,
|
||||
"candidates_amount_after_filtering": candidates_after,
|
||||
}
|
||||
else:
|
||||
data_overall = {
|
||||
"similarity_threshold": similarity_threshold,
|
||||
"related_threshold": related_thresh,
|
||||
"source_set_amount": len(source_sets),
|
||||
"label": label,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
}
|
||||
|
||||
# Save results to a CSV file
|
||||
save_experiment_results_to_csv(
|
||||
results=data_overall,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||
)
|
||||
|
||||
|
||||
elapsed_times_final.append(elapsed_times)
|
||||
_ = plot_elapsed_times(
|
||||
related_thresholds=related_thresholds,
|
||||
elapsed_times_list=elapsed_times_final,
|
||||
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
|
||||
legend_labels=labels,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png"
|
||||
)
|
||||
|
||||
|
||||
def run_scalability_experiment(related_thresholds, similarity_threshold, set_sizes, source_sets, reference_sets,
|
||||
sim_metric, sim_func, is_search, file_name_prefix, folder_path):
|
||||
"""
|
||||
Parameters
|
||||
----------
|
||||
related_thresholds : list[float]
|
||||
Thresholds for determining relatedness between sets.
|
||||
similarity_threshold : float
|
||||
Thresholds for measuring similarity between sets.
|
||||
set_sizes : list[int]
|
||||
Sizes of the sets to be used in the experiment.
|
||||
source_sets : list[]
|
||||
The sets to be compared against the reference sets or against itself.
|
||||
reference_sets : list[]
|
||||
The sets used as the reference for comparison.
|
||||
sim_metric : callable
|
||||
The metric function used to evaluate similarity between sets.
|
||||
sim_func : callable
|
||||
The function used to calculate similarity scores.
|
||||
is_search : bool
|
||||
Flag indicating whether to perform a search operation or discovery.
|
||||
file_name_prefix : str
|
||||
Prefix for naming output files generated during the experiment.
|
||||
folder_path: str
|
||||
Path to the folder where results will be saved.
|
||||
"""
|
||||
elapsed_times_final = []
|
||||
for idx, related_thresh in enumerate(related_thresholds):
|
||||
elapsed_times = []
|
||||
for size in set_sizes:
|
||||
in_index_time_start = time.time()
|
||||
initial_ram = measure_ram_usage()
|
||||
|
||||
# Initialize and run the SilkMothEngine
|
||||
silk_moth_engine = SilkMothEngine(
|
||||
related_thresh=0,
|
||||
source_sets=source_sets[:size],
|
||||
sim_metric=sim_metric,
|
||||
sim_func=sim_func,
|
||||
sim_thresh=similarity_threshold,
|
||||
is_check_filter=True,
|
||||
is_nn_filter=True,
|
||||
)
|
||||
in_index_time_end = time.time()
|
||||
final_ram = measure_ram_usage()
|
||||
|
||||
in_index_elapsed_time = in_index_time_end - in_index_time_start
|
||||
in_index_ram_usage = final_ram - initial_ram
|
||||
|
||||
print(f"Inverted Index created in {in_index_elapsed_time:.2f} seconds.")
|
||||
|
||||
|
||||
print(
|
||||
f"\nRunning SilkMoth {file_name_prefix} with α = {similarity_threshold}, θ = {related_thresh}, set_size = {size}")
|
||||
|
||||
silk_moth_engine.set_related_threshold(related_thresh)
|
||||
# Measure the time taken to search for related sets
|
||||
time_start = time.time()
|
||||
|
||||
if sim_func == edit_similarity:
|
||||
# calc the maximum possible q-gram size based on sim_thresh
|
||||
upper_bound_q = similarity_threshold / (1 - similarity_threshold)
|
||||
q = floor(upper_bound_q)
|
||||
|
||||
print(f"Using q = {q} for edit similarity with sim_thresh = {similarity_threshold}")
|
||||
print(f"Rebuilding Inverted Index with q = {q}...")
|
||||
silk_moth_engine.set_q(q)
|
||||
|
||||
# Used for search to see how many candidates were found and how many were removed
|
||||
candidates_amount = 0
|
||||
candidates_after = 0
|
||||
if is_search:
|
||||
for ref_id, ref_set in enumerate(reference_sets):
|
||||
related_sets_temp, candidates_amount_temp, candidates_removed_temp = silk_moth_engine.search_sets(
|
||||
ref_set)
|
||||
candidates_amount += candidates_amount_temp
|
||||
candidates_after += candidates_removed_temp
|
||||
else:
|
||||
# If not searching, we are discovering sets
|
||||
silk_moth_engine.discover_sets(source_sets[:size])
|
||||
|
||||
time_end = time.time()
|
||||
elapsed_time = time_end - time_start
|
||||
|
||||
elapsed_times.append(elapsed_time)
|
||||
|
||||
# Create a new data dictionary for each iteration
|
||||
if is_search:
|
||||
data_overall = {
|
||||
"similarity_threshold": similarity_threshold,
|
||||
"related_threshold": related_thresh,
|
||||
"reference_set_amount": len(reference_sets),
|
||||
"source_set_amount": len(source_sets[:size]),
|
||||
"set_size": size,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
"candidates_amount": candidates_amount,
|
||||
"candidates_amount_after_filtering": candidates_after,
|
||||
}
|
||||
else:
|
||||
data_overall = {
|
||||
"similarity_threshold": similarity_threshold,
|
||||
"related_threshold": related_thresh,
|
||||
"source_set_amount": len(source_sets[:size]),
|
||||
"set_size": size,
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"inverted_index_time": round(in_index_elapsed_time, 3),
|
||||
"inverted_index_ram_usage": round(in_index_ram_usage, 3),
|
||||
}
|
||||
|
||||
# Save results to a CSV file
|
||||
save_experiment_results_to_csv(
|
||||
results=data_overall,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||
)
|
||||
del silk_moth_engine
|
||||
|
||||
elapsed_times_final.append(elapsed_times)
|
||||
|
||||
# create legend labels based on set sizes
|
||||
adjusted_legend_labels = [f"θ = {rt}" for rt in related_thresholds]
|
||||
adjusted_set_sizes = [size / 100_000 for size in set_sizes]
|
||||
_ = plot_elapsed_times(
|
||||
related_thresholds=adjusted_set_sizes,
|
||||
elapsed_times_list=elapsed_times_final,
|
||||
fig_text=f"{file_name_prefix} (α = {similarity_threshold})",
|
||||
legend_labels=adjusted_legend_labels,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_α={similarity_threshold}.png",
|
||||
xlabel="Number of Sets (in 100ks)",
|
||||
)
|
||||
|
||||
def run_matching_without_silkmoth_inc_dep(source_sets, reference_sets, related_thresholds, similarity_threshold, sim_metric, sim_fun , file_name_prefix, folder_path):
|
||||
|
||||
tokenizer = Tokenizer(sim_func=sim_fun)
|
||||
|
||||
for related_thresh in related_thresholds:
|
||||
verifier = Verifier(sim_thresh=similarity_threshold, related_thresh=related_thresh,
|
||||
sim_metric=sim_metric, sim_func=sim_fun, reduction=False)
|
||||
related_sets = []
|
||||
time_start = time.time()
|
||||
for ref in reference_sets:
|
||||
for source in source_sets:
|
||||
if len(ref) > len(source):
|
||||
continue
|
||||
relatedness = verifier.get_relatedness(tokenizer.tokenize(ref), tokenizer.tokenize(source))
|
||||
if relatedness >= related_thresh:
|
||||
related_sets.append((source, relatedness))
|
||||
|
||||
time_end = time.time()
|
||||
elapsed_time = time_end - time_start
|
||||
|
||||
data_overall = {
|
||||
"similarity_threshold": similarity_threshold,
|
||||
"related_threshold": related_thresh,
|
||||
"source_set_amount": len(source_sets),
|
||||
"reference_set_amount": len(reference_sets),
|
||||
"label": "RAW MATCH",
|
||||
"elapsed_time": round(elapsed_time, 3),
|
||||
"matches_found": len(related_sets)
|
||||
}
|
||||
|
||||
# Save results to a CSV file
|
||||
save_experiment_results_to_csv(
|
||||
results=data_overall,
|
||||
file_name=f"{folder_path}{file_name_prefix}_experiment_results.csv"
|
||||
)
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
|
||||
0.0,0.7,1000,500000,NO FILTER,1036.548,49.107,7727.559,3006749,3006749,986715
|
||||
0.0,0.75,1000,500000,NO FILTER,871.225,49.107,7727.559,2673348,2673348,964206
|
||||
0.0,0.8,1000,500000,NO FILTER,695.528,49.107,7727.559,2273416,2273416,934002
|
||||
0.0,0.85,1000,500000,NO FILTER,548.878,49.107,7727.559,1907985,1907985,879744
|
||||
0.0,0.7,1000,500000,CHECK FILTER,980.124,49.107,7727.559,3006749,2852034,986715
|
||||
0.0,0.75,1000,500000,CHECK FILTER,789.947,49.107,7727.559,2673348,2531660,964206
|
||||
0.0,0.8,1000,500000,CHECK FILTER,590.707,49.107,7727.559,2273416,2107346,934002
|
||||
0.0,0.85,1000,500000,CHECK FILTER,427.982,49.107,7727.559,1907985,1728877,879744
|
||||
0.0,0.7,1000,500000,NN FILTER,533.776,49.107,7727.559,3006749,2547,2535
|
||||
0.0,0.75,1000,500000,NN FILTER,448.358,49.107,7727.559,2673348,2394,2382
|
||||
0.0,0.8,1000,500000,NN FILTER,359.112,49.107,7727.559,2273416,1077,1077
|
||||
0.0,0.85,1000,500000,NN FILTER,268.529,49.107,7727.559,1907985,1037,1037
|
||||
0.25,0.7,1000,500000,NO FILTER,1038.225,49.107,7727.559,3006749,3006749,984756
|
||||
0.25,0.75,1000,500000,NO FILTER,866.06,49.107,7727.559,2673348,2673348,963792
|
||||
0.25,0.8,1000,500000,NO FILTER,693.589,49.107,7727.559,2273416,2273416,933799
|
||||
0.25,0.85,1000,500000,NO FILTER,545.784,49.107,7727.559,1907985,1907985,878482
|
||||
0.25,0.7,1000,500000,CHECK FILTER,975.103,49.107,7727.559,3006749,2852028,984756
|
||||
0.25,0.75,1000,500000,CHECK FILTER,787.87,49.107,7727.559,2673348,2531660,963792
|
||||
0.25,0.8,1000,500000,CHECK FILTER,589.608,49.107,7727.559,2273416,2107346,933799
|
||||
0.25,0.85,1000,500000,CHECK FILTER,426.222,49.107,7727.559,1907985,1728877,878482
|
||||
0.25,0.7,1000,500000,NN FILTER,573.448,49.107,7727.559,3006749,2544,2532
|
||||
0.25,0.75,1000,500000,NN FILTER,483.1,49.107,7727.559,2673348,2394,2382
|
||||
0.25,0.8,1000,500000,NN FILTER,385.999,49.107,7727.559,2273416,1077,1077
|
||||
0.25,0.85,1000,500000,NN FILTER,288.687,49.107,7727.559,1907985,1037,1037
|
||||
0.5,0.7,1000,500000,NO FILTER,1031.681,49.107,7727.559,3006749,3006749,975892
|
||||
0.5,0.75,1000,500000,NO FILTER,867.694,49.107,7727.559,2673348,2673348,951793
|
||||
0.5,0.8,1000,500000,NO FILTER,693.398,49.107,7727.559,2273416,2273416,931599
|
||||
0.5,0.85,1000,500000,NO FILTER,546.702,49.107,7727.559,1907985,1907985,875833
|
||||
0.5,0.7,1000,500000,CHECK FILTER,971.71,49.107,7727.559,3006749,2848668,975892
|
||||
0.5,0.75,1000,500000,CHECK FILTER,783.145,49.107,7727.559,2673348,2529966,951793
|
||||
0.5,0.8,1000,500000,CHECK FILTER,585.346,49.107,7727.559,2273416,2106355,931599
|
||||
0.5,0.85,1000,500000,CHECK FILTER,424.629,49.107,7727.559,1907985,1728640,875833
|
||||
0.5,0.7,1000,500000,NN FILTER,573.046,49.107,7727.559,3006749,2544,2532
|
||||
0.5,0.75,1000,500000,NN FILTER,482.035,49.107,7727.559,2673348,2394,2382
|
||||
0.5,0.8,1000,500000,NN FILTER,385.754,49.107,7727.559,2273416,1077,1077
|
||||
0.5,0.85,1000,500000,NN FILTER,288.24,49.107,7727.559,1907985,1037,1037
|
||||
0.75,0.7,1000,500000,NO FILTER,1032.605,49.107,7727.559,3006749,3006749,973885
|
||||
0.75,0.75,1000,500000,NO FILTER,866.218,49.107,7727.559,2673348,2673348,949627
|
||||
0.75,0.8,1000,500000,NO FILTER,693.19,49.107,7727.559,2273416,2273416,929232
|
||||
0.75,0.85,1000,500000,NO FILTER,548.07,49.107,7727.559,1907985,1907985,875163
|
||||
0.75,0.7,1000,500000,CHECK FILTER,960.003,49.107,7727.559,3006749,2838145,973885
|
||||
0.75,0.75,1000,500000,CHECK FILTER,773.8,49.107,7727.559,2673348,2519134,949627
|
||||
0.75,0.8,1000,500000,CHECK FILTER,577.671,49.107,7727.559,2273416,2100303,929232
|
||||
0.75,0.85,1000,500000,CHECK FILTER,417.292,49.107,7727.559,1907985,1725354,875163
|
||||
0.75,0.7,1000,500000,NN FILTER,544.018,49.107,7727.559,3006749,2544,2532
|
||||
0.75,0.75,1000,500000,NN FILTER,463.915,49.107,7727.559,2673348,2394,2382
|
||||
0.75,0.8,1000,500000,NN FILTER,378.184,49.107,7727.559,2273416,1077,1077
|
||||
0.75,0.85,1000,500000,NN FILTER,285.8,49.107,7727.559,1907985,1040,1040
|
||||
|
|
After Width: | Height: | Size: 195 KiB |
|
After Width: | Height: | Size: 199 KiB |
|
After Width: | Height: | Size: 198 KiB |
|
After Width: | Height: | Size: 195 KiB |
|
After Width: | Height: | Size: 125 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering,related_sets_found
|
||||
0.0,0.7,200,500000,NO FILTER,6753.593,49.277,7720.887,622080,622080,233513
|
||||
0.0,0.75,200,500000,NO FILTER,6812.967,49.277,7720.887,575078,575078,223644
|
||||
0.0,0.8,200,500000,NO FILTER,4953.635,49.277,7720.887,479650,479650,221376
|
||||
0.0,0.85,200,500000,NO FILTER,4212.413,49.277,7720.887,423078,423078,196944
|
||||
0.0,0.7,200,500000,CHECK FILTER,3835.233,49.277,7720.887,622080,589307,233513
|
||||
0.0,0.75,200,500000,CHECK FILTER,3348.061,49.277,7720.887,575078,549687,223644
|
||||
0.0,0.8,200,500000,CHECK FILTER,2414.995,49.277,7720.887,479650,438680,221376
|
||||
0.0,0.85,200,500000,CHECK FILTER,1874.261,49.277,7720.887,423078,393028,196944
|
||||
0.0,0.7,200,500000,NN FILTER,126.601,49.277,7720.887,622080,615,603
|
||||
0.0,0.75,200,500000,NN FILTER,108.886,49.277,7720.887,575078,332,320
|
||||
0.0,0.8,200,500000,NN FILTER,80.436,49.277,7720.887,479650,1,1
|
||||
0.0,0.85,200,500000,NN FILTER,59.824,49.277,7720.887,423078,1,1
|
||||
0.25,0.7,200,500000,NO FILTER,2191.216,49.277,7720.887,622080,622080,232290
|
||||
0.25,0.75,200,500000,NO FILTER,1915.087,49.277,7720.887,575078,575078,223444
|
||||
0.25,0.8,200,500000,NO FILTER,1544.113,49.277,7720.887,479650,479650,221284
|
||||
0.25,0.85,200,500000,NO FILTER,1354.29,49.277,7720.887,423078,423078,196116
|
||||
0.25,0.7,200,500000,CHECK FILTER,1809.643,49.277,7720.887,622080,589307,232290
|
||||
0.25,0.75,200,500000,CHECK FILTER,1548.963,49.277,7720.887,575078,549687,223444
|
||||
0.25,0.8,200,500000,CHECK FILTER,1277.618,49.277,7720.887,479650,438680,221284
|
||||
0.25,0.85,200,500000,CHECK FILTER,1111.088,49.277,7720.887,423078,393028,196116
|
||||
0.25,0.7,200,500000,NN FILTER,131.183,49.277,7720.887,622080,615,603
|
||||
0.25,0.75,200,500000,NN FILTER,114.192,49.277,7720.887,575078,332,320
|
||||
0.25,0.8,200,500000,NN FILTER,84.253,49.277,7720.887,479650,1,1
|
||||
0.25,0.85,200,500000,NN FILTER,62.864,49.277,7720.887,423078,1,1
|
||||
0.5,0.7,200,500000,NO FILTER,1682.409,49.277,7720.887,622080,622080,230903
|
||||
0.5,0.75,200,500000,NO FILTER,1491.797,49.277,7720.887,575078,575078,222613
|
||||
0.5,0.8,200,500000,NO FILTER,1250.727,49.277,7720.887,479650,479650,219875
|
||||
0.5,0.85,200,500000,NO FILTER,1083.762,49.277,7720.887,423078,423078,195759
|
||||
0.5,0.7,200,500000,CHECK FILTER,1436.208,49.277,7720.887,622080,588701,230903
|
||||
0.5,0.75,200,500000,CHECK FILTER,1250.22,49.277,7720.887,575078,549178,222613
|
||||
0.5,0.8,200,500000,CHECK FILTER,1023.904,49.277,7720.887,479650,438258,219875
|
||||
0.5,0.85,200,500000,CHECK FILTER,893.938,49.277,7720.887,423078,392937,195759
|
||||
0.5,0.7,200,500000,NN FILTER,129.51,49.277,7720.887,622080,615,603
|
||||
0.5,0.75,200,500000,NN FILTER,112.158,49.277,7720.887,575078,332,320
|
||||
0.5,0.8,200,500000,NN FILTER,83.434,49.277,7720.887,479650,1,1
|
||||
0.5,0.85,200,500000,NN FILTER,62.648,49.277,7720.887,423078,1,1
|
||||
0.75,0.7,200,500000,NO FILTER,1447.675,49.277,7720.887,622080,622080,230497
|
||||
0.75,0.75,200,500000,NO FILTER,1270.052,49.277,7720.887,575078,575078,222063
|
||||
0.75,0.8,200,500000,NO FILTER,1039.89,49.277,7720.887,479650,479650,219411
|
||||
0.75,0.85,200,500000,NO FILTER,879.273,49.277,7720.887,423078,423078,195601
|
||||
0.75,0.7,200,500000,CHECK FILTER,1193.541,49.277,7720.887,622080,586297,230497
|
||||
0.75,0.75,200,500000,CHECK FILTER,1023.672,49.277,7720.887,575078,546701,222063
|
||||
0.75,0.8,200,500000,CHECK FILTER,825.541,49.277,7720.887,479650,436782,219411
|
||||
0.75,0.85,200,500000,CHECK FILTER,704.52,49.277,7720.887,423078,391809,195601
|
||||
0.75,0.7,200,500000,NN FILTER,120.522,49.277,7720.887,622080,615,603
|
||||
0.75,0.75,200,500000,NN FILTER,107.657,49.277,7720.887,575078,332,320
|
||||
0.75,0.8,200,500000,NN FILTER,78.897,49.277,7720.887,479650,1,1
|
||||
0.75,0.85,200,500000,NN FILTER,57.66,49.277,7720.887,423078,1,1
|
||||
|
|
After Width: | Height: | Size: 140 KiB |
|
After Width: | Height: | Size: 139 KiB |
|
After Width: | Height: | Size: 151 KiB |
|
After Width: | Height: | Size: 149 KiB |
@@ -0,0 +1,2 @@
|
||||
experiment name,elem/set,tokens/elem
|
||||
Inclusion Dependency,17.81003,25.41035090901026
|
||||
|
@@ -0,0 +1,17 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||
0.0,0.7,200,500000,REDUCTION,6283.871,45.782,7700.914,622080,622080
|
||||
0.0,0.75,200,500000,REDUCTION,5651.069,45.782,7700.914,575078,575078
|
||||
0.0,0.8,200,500000,REDUCTION,4170.768,45.782,7700.914,479650,479650
|
||||
0.0,0.85,200,500000,REDUCTION,3514.723,45.782,7700.914,423078,423078
|
||||
0.0,0.7,200,500000,NO REDUCTION,6771.001,45.782,7700.914,622080,622080
|
||||
0.0,0.75,200,500000,NO REDUCTION,6117.305,45.782,7700.914,575078,575078
|
||||
0.0,0.8,200,500000,NO REDUCTION,4573.585,45.782,7700.914,479650,479650
|
||||
0.0,0.85,200,500000,NO REDUCTION,3894.681,45.782,7700.914,423078,423078
|
||||
0.0,0.7,200,500000,REDUCTION,6142.242,49.376,7721.383,622080,622080
|
||||
0.0,0.75,200,500000,REDUCTION,5495.346,49.376,7721.383,575078,575078
|
||||
0.0,0.8,200,500000,REDUCTION,4061.815,49.376,7721.383,479650,479650
|
||||
0.0,0.85,200,500000,REDUCTION,3429.474,49.376,7721.383,423078,423078
|
||||
0.0,0.7,200,500000,NO REDUCTION,6622.959,49.376,7721.383,622080,622080
|
||||
0.0,0.75,200,500000,NO REDUCTION,5960.971,49.376,7721.383,575078,575078
|
||||
0.0,0.8,200,500000,NO REDUCTION,4489.11,49.376,7721.383,479650,479650
|
||||
0.0,0.85,200,500000,NO REDUCTION,3794.505,49.376,7721.383,423078,423078
|
||||
|
|
After Width: | Height: | Size: 166 KiB |
@@ -0,0 +1,21 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||
0.5,0.7,200,100000,100000,69.222,11.405,1554.535,134576,46830
|
||||
0.5,0.7,200,200000,200000,134.718,23.409,1659.543,254379,93573
|
||||
0.5,0.7,200,300000,300000,206.136,32.782,1791.512,373007,139377
|
||||
0.5,0.7,200,400000,400000,275.559,51.827,2040.961,499998,186205
|
||||
0.5,0.7,200,500000,500000,353.944,51.169,2027.262,622080,233091
|
||||
0.5,0.75,200,100000,100000,64.988,5.539,0.254,124611,45115
|
||||
0.5,0.75,200,200000,200000,126.721,24.159,192.152,236137,90048
|
||||
0.5,0.75,200,300000,300000,193.126,32.91,2217.562,347108,134199
|
||||
0.5,0.75,200,400000,400000,259.254,50.945,1535.723,462815,179223
|
||||
0.5,0.75,200,500000,500000,328.0,59.734,2526.176,575078,224315
|
||||
0.5,0.8,200,100000,100000,59.984,5.544,0.77,104812,44549
|
||||
0.5,0.8,200,200000,200000,123.595,23.419,-229.445,202489,88907
|
||||
0.5,0.8,200,300000,300000,183.55,37.277,2302.273,300462,132525
|
||||
0.5,0.8,200,400000,400000,239.431,45.86,1268.406,386895,176985
|
||||
0.5,0.8,200,500000,500000,311.525,58.657,2716.348,479650,221057
|
||||
0.5,0.85,200,100000,100000,56.371,9.486,-151.641,87451,39657
|
||||
0.5,0.85,200,200000,200000,108.674,23.698,-889.457,171938,79056
|
||||
0.5,0.85,200,300000,300000,164.616,33.799,2748.523,251392,117969
|
||||
0.5,0.85,200,400000,400000,220.908,45.263,805.023,331901,157572
|
||||
0.5,0.85,200,500000,500000,281.56,65.197,3474.547,423078,197145
|
||||
|
|
After Width: | Height: | Size: 241 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||
0.0,0.7,200,500000,SigType.WEIGHTED,6915.71,47.599,7701.59,622080,622080
|
||||
0.0,0.75,200,500000,SigType.WEIGHTED,6230.769,47.599,7701.59,575078,575078
|
||||
0.0,0.8,200,500000,SigType.WEIGHTED,4633.178,47.599,7701.59,479650,479650
|
||||
0.0,0.85,200,500000,SigType.WEIGHTED,3948.011,47.599,7701.59,423078,423078
|
||||
0.0,0.7,200,500000,SigType.SKYLINE,6839.554,47.599,7701.59,622080,622080
|
||||
0.0,0.75,200,500000,SigType.SKYLINE,6156.19,47.599,7701.59,575078,575078
|
||||
0.0,0.8,200,500000,SigType.SKYLINE,4601.987,47.599,7701.59,479650,479650
|
||||
0.0,0.85,200,500000,SigType.SKYLINE,3921.286,47.599,7701.59,423078,423078
|
||||
0.0,0.7,200,500000,SigType.DICHOTOMY,6824.442,47.599,7701.59,622080,622080
|
||||
0.0,0.75,200,500000,SigType.DICHOTOMY,6158.089,47.599,7701.59,575078,575078
|
||||
0.0,0.8,200,500000,SigType.DICHOTOMY,4601.877,47.599,7701.59,479650,479650
|
||||
0.0,0.85,200,500000,SigType.DICHOTOMY,3923.695,47.599,7701.59,423078,423078
|
||||
0.25,0.7,200,500000,SigType.WEIGHTED,1990.666,47.599,7701.59,622080,622080
|
||||
0.25,0.75,200,500000,SigType.WEIGHTED,1722.451,47.599,7701.59,575078,575078
|
||||
0.25,0.8,200,500000,SigType.WEIGHTED,1438.235,47.599,7701.59,479650,479650
|
||||
0.25,0.85,200,500000,SigType.WEIGHTED,1264.852,47.599,7701.59,423078,423078
|
||||
0.25,0.7,200,500000,SigType.SKYLINE,1989.546,47.599,7701.59,622080,622080
|
||||
0.25,0.75,200,500000,SigType.SKYLINE,1719.169,47.599,7701.59,575078,575078
|
||||
0.25,0.8,200,500000,SigType.SKYLINE,1440.077,47.599,7701.59,479650,479650
|
||||
0.25,0.85,200,500000,SigType.SKYLINE,1267.701,47.599,7701.59,423078,423078
|
||||
0.25,0.7,200,500000,SigType.DICHOTOMY,2046.949,47.599,7701.59,622270,622270
|
||||
0.25,0.75,200,500000,SigType.DICHOTOMY,1966.499,47.599,7701.59,575268,575268
|
||||
0.25,0.8,200,500000,SigType.DICHOTOMY,1485.458,47.599,7701.59,479650,479650
|
||||
0.25,0.85,200,500000,SigType.DICHOTOMY,1436.847,47.599,7701.59,423078,423078
|
||||
0.5,0.7,200,500000,SigType.WEIGHTED,1767.439,47.599,7701.59,622080,622080
|
||||
0.5,0.75,200,500000,SigType.WEIGHTED,1565.259,47.599,7701.59,575078,575078
|
||||
0.5,0.8,200,500000,SigType.WEIGHTED,1160.579,47.599,7701.59,479650,479650
|
||||
0.5,0.85,200,500000,SigType.WEIGHTED,1014.452,47.599,7701.59,423078,423078
|
||||
0.5,0.7,200,500000,SigType.SKYLINE,1589.081,47.599,7701.59,622054,622054
|
||||
0.5,0.75,200,500000,SigType.SKYLINE,1393.117,47.599,7701.59,575050,575050
|
||||
0.5,0.8,200,500000,SigType.SKYLINE,1154.931,47.599,7701.59,479622,479622
|
||||
0.5,0.85,200,500000,SigType.SKYLINE,1025.061,47.599,7701.59,423078,423078
|
||||
0.5,0.7,200,500000,SigType.DICHOTOMY,2777.528,47.599,7701.59,936785,936785
|
||||
0.5,0.75,200,500000,SigType.DICHOTOMY,2340.389,47.599,7701.59,888736,888736
|
||||
0.5,0.8,200,500000,SigType.DICHOTOMY,1678.145,47.599,7701.59,673929,673929
|
||||
0.5,0.85,200,500000,SigType.DICHOTOMY,1374.518,47.599,7701.59,517483,517483
|
||||
0.75,0.7,200,500000,SigType.WEIGHTED,1354.402,47.599,7701.59,622080,622080
|
||||
0.75,0.75,200,500000,SigType.WEIGHTED,1187.603,47.599,7701.59,575078,575078
|
||||
0.75,0.8,200,500000,SigType.WEIGHTED,971.469,47.599,7701.59,479650,479650
|
||||
0.75,0.85,200,500000,SigType.WEIGHTED,822.075,47.599,7701.59,423078,423078
|
||||
0.75,0.7,200,500000,SigType.SKYLINE,1303.676,47.599,7701.59,594466,594466
|
||||
0.75,0.75,200,500000,SigType.SKYLINE,1152.405,47.599,7701.59,560020,560020
|
||||
0.75,0.8,200,500000,SigType.SKYLINE,932.283,47.599,7701.59,467458,467458
|
||||
0.75,0.85,200,500000,SigType.SKYLINE,816.709,47.599,7701.59,420962,420962
|
||||
0.75,0.7,200,500000,SigType.DICHOTOMY,5710.524,47.599,7701.59,2410732,2410732
|
||||
0.75,0.75,200,500000,SigType.DICHOTOMY,5072.603,47.599,7701.59,2145096,2145096
|
||||
0.75,0.8,200,500000,SigType.DICHOTOMY,4403.341,47.599,7701.59,1739362,1739362
|
||||
0.75,0.85,200,500000,SigType.DICHOTOMY,2735.424,47.599,7701.59,1078937,1078937
|
||||
|
|
After Width: | Height: | Size: 200 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 159 KiB |
@@ -0,0 +1,5 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,reference_set_amount,label,elapsed_time,matches_found
|
||||
0.5,0.7,500000,200,RAW MATCH,6945.364,230903
|
||||
0.5,0.75,500000,200,RAW MATCH,6965.759,222613
|
||||
0.5,0.8,500000,200,RAW MATCH,6974.576,219875
|
||||
0.5,0.85,500000,200,RAW MATCH,7011.368,195759
|
||||
|
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,reference_set_amount,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage,candidates_amount,candidates_amount_after_filtering
|
||||
0.0,0.7,60000,60000,NO FILTER,3321.166,2.336,115.465,3055067,3055067
|
||||
0.0,0.75,60000,60000,NO FILTER,1997.976,2.336,115.465,2321584,2321584
|
||||
0.0,0.8,60000,60000,NO FILTER,1226.647,2.336,115.465,1265300,1265300
|
||||
0.0,0.85,60000,60000,NO FILTER,530.302,2.336,115.465,642202,642202
|
||||
0.0,0.7,60000,60000,CHECK FILTER,3766.567,2.336,115.465,3055067,2464704
|
||||
0.0,0.75,60000,60000,CHECK FILTER,2241.664,2.336,115.465,2321584,1780582
|
||||
0.0,0.8,60000,60000,CHECK FILTER,1371.372,2.336,115.465,1265300,936432
|
||||
0.0,0.85,60000,60000,CHECK FILTER,2052.574,2.336,115.465,642202,523745
|
||||
0.0,0.7,60000,60000,NN FILTER,1752.545,2.336,115.465,3055067,0
|
||||
0.0,0.75,60000,60000,NN FILTER,1410.607,2.336,115.465,2321584,0
|
||||
0.0,0.8,60000,60000,NN FILTER,817.098,2.336,115.465,1265300,0
|
||||
0.0,0.85,60000,60000,NN FILTER,450.277,2.336,115.465,642202,0
|
||||
0.25,0.7,60000,60000,NO FILTER,4295.794,2.336,115.465,3055067,3055067
|
||||
0.25,0.75,60000,60000,NO FILTER,1973.377,2.336,115.465,2321584,2321584
|
||||
0.25,0.8,60000,60000,NO FILTER,1212.983,2.336,115.465,1265300,1265300
|
||||
0.25,0.85,60000,60000,NO FILTER,522.616,2.336,115.465,642202,642202
|
||||
0.25,0.7,60000,60000,CHECK FILTER,3200.851,2.336,115.465,3055067,2455726
|
||||
0.25,0.75,60000,60000,CHECK FILTER,1889.267,2.336,115.465,2321584,1770634
|
||||
0.25,0.8,60000,60000,CHECK FILTER,1147.932,2.336,115.465,1265300,928712
|
||||
0.25,0.85,60000,60000,CHECK FILTER,498.44,2.336,115.465,642202,522759
|
||||
0.25,0.7,60000,60000,NN FILTER,122.104,2.336,115.465,3055067,0
|
||||
0.25,0.75,60000,60000,NN FILTER,88.259,2.336,115.465,2321584,0
|
||||
0.25,0.8,60000,60000,NN FILTER,49.714,2.336,115.465,1265300,0
|
||||
0.25,0.85,60000,60000,NN FILTER,23.838,2.336,115.465,642202,0
|
||||
0.5,0.7,60000,60000,NO FILTER,3272.056,2.336,115.465,3055067,3055067
|
||||
0.5,0.75,60000,60000,NO FILTER,1961.328,2.336,115.465,2321584,2321584
|
||||
0.5,0.8,60000,60000,NO FILTER,1200.994,2.336,115.465,1265300,1265300
|
||||
0.5,0.85,60000,60000,NO FILTER,511.108,2.336,115.465,642202,642202
|
||||
0.5,0.7,60000,60000,CHECK FILTER,3183.991,2.336,115.465,3055067,2437997
|
||||
0.5,0.75,60000,60000,CHECK FILTER,1875.468,2.336,115.465,2321584,1756738
|
||||
0.5,0.8,60000,60000,CHECK FILTER,1137.157,2.336,115.465,1265300,918967
|
||||
0.5,0.85,60000,60000,CHECK FILTER,488.508,2.336,115.465,642202,517859
|
||||
0.5,0.7,60000,60000,NN FILTER,120.567,2.336,115.465,3055067,0
|
||||
0.5,0.75,60000,60000,NN FILTER,87.173,2.336,115.465,2321584,0
|
||||
0.5,0.8,60000,60000,NN FILTER,49.292,2.336,115.465,1265300,0
|
||||
0.5,0.85,60000,60000,NN FILTER,23.97,2.336,115.465,642202,0
|
||||
0.75,0.7,60000,60000,NO FILTER,3085.617,2.336,115.465,3055067,3055067
|
||||
0.75,0.75,60000,60000,NO FILTER,1788.559,2.336,115.465,2321584,2321584
|
||||
0.75,0.8,60000,60000,NO FILTER,1046.714,2.336,115.465,1265300,1265300
|
||||
0.75,0.85,60000,60000,NO FILTER,481.793,2.336,115.465,642202,642202
|
||||
0.75,0.7,60000,60000,CHECK FILTER,2991.745,2.336,115.465,3055067,2428269
|
||||
0.75,0.75,60000,60000,CHECK FILTER,1699.433,2.336,115.465,2321584,1750589
|
||||
0.75,0.8,60000,60000,CHECK FILTER,983.657,2.336,115.465,1265300,916628
|
||||
0.75,0.85,60000,60000,CHECK FILTER,458.081,2.336,115.465,642202,516012
|
||||
0.75,0.7,60000,60000,NN FILTER,119.557,2.336,115.465,3055067,0
|
||||
0.75,0.75,60000,60000,NN FILTER,86.338,2.336,115.465,2321584,0
|
||||
0.75,0.8,60000,60000,NN FILTER,48.63,2.336,115.465,1265300,0
|
||||
0.75,0.85,60000,60000,NN FILTER,23.63,2.336,115.465,642202,0
|
||||
|
|
After Width: | Height: | Size: 198 KiB |
|
After Width: | Height: | Size: 164 KiB |
|
After Width: | Height: | Size: 171 KiB |
|
After Width: | Height: | Size: 173 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.0,0.7,60000,NO FILTER,5210.037,1.383,95.605
|
||||
0.0,0.75,60000,NO FILTER,4654.41,1.383,95.605
|
||||
0.0,0.8,60000,NO FILTER,3891.372,1.383,95.605
|
||||
0.0,0.85,60000,NO FILTER,3561.118,1.383,95.605
|
||||
0.0,0.7,60000,CHECK FILTER,5374.941,1.383,95.605
|
||||
0.0,0.75,60000,CHECK FILTER,4772.542,1.383,95.605
|
||||
0.0,0.8,60000,CHECK FILTER,4004.38,1.383,95.605
|
||||
0.0,0.85,60000,CHECK FILTER,3653.843,1.383,95.605
|
||||
0.0,0.7,60000,NN FILTER,3889.903,1.383,95.605
|
||||
0.0,0.75,60000,NN FILTER,3739.136,1.383,95.605
|
||||
0.0,0.8,60000,NN FILTER,3609.17,1.383,95.605
|
||||
0.0,0.85,60000,NN FILTER,3517.33,1.383,95.605
|
||||
0.25,0.7,60000,NO FILTER,5157.674,1.383,95.605
|
||||
0.25,0.75,60000,NO FILTER,4621.14,1.383,95.605
|
||||
0.25,0.8,60000,NO FILTER,3905.856,1.383,95.605
|
||||
0.25,0.85,60000,NO FILTER,3598.239,1.383,95.605
|
||||
0.25,0.7,60000,CHECK FILTER,5331.451,1.383,95.605
|
||||
0.25,0.75,60000,CHECK FILTER,4769.428,1.383,95.605
|
||||
0.25,0.8,60000,CHECK FILTER,4042.779,1.383,95.605
|
||||
0.25,0.85,60000,CHECK FILTER,3709.669,1.383,95.605
|
||||
0.25,0.7,60000,NN FILTER,3910.54,1.383,95.605
|
||||
0.25,0.75,60000,NN FILTER,3760.587,1.383,95.605
|
||||
0.25,0.8,60000,NN FILTER,3644.443,1.383,95.605
|
||||
0.25,0.85,60000,NN FILTER,3558.579,1.383,95.605
|
||||
0.5,0.7,60000,NO FILTER,5143.478,1.383,95.605
|
||||
0.5,0.75,60000,NO FILTER,4670.328,1.383,95.605
|
||||
0.5,0.8,60000,NO FILTER,3917.002,1.383,95.605
|
||||
0.5,0.85,60000,NO FILTER,3556.487,1.383,95.605
|
||||
0.5,0.7,60000,CHECK FILTER,5279.287,1.383,95.605
|
||||
0.5,0.75,60000,CHECK FILTER,4749.58,1.383,95.605
|
||||
0.5,0.8,60000,CHECK FILTER,4009.224,1.383,95.605
|
||||
0.5,0.85,60000,CHECK FILTER,3659.874,1.383,95.605
|
||||
0.5,0.7,60000,NN FILTER,3897.174,1.383,95.605
|
||||
0.5,0.75,60000,NN FILTER,3771.733,1.383,95.605
|
||||
0.5,0.8,60000,NN FILTER,3657.094,1.383,95.605
|
||||
0.5,0.85,60000,NN FILTER,3553.523,1.383,95.605
|
||||
0.75,0.7,60000,NO FILTER,5107.903,1.383,95.605
|
||||
0.75,0.75,60000,NO FILTER,4582.298,1.383,95.605
|
||||
0.75,0.8,60000,NO FILTER,3889.505,1.383,95.605
|
||||
0.75,0.85,60000,NO FILTER,3559.531,1.383,95.605
|
||||
0.75,0.7,60000,CHECK FILTER,5254.747,1.383,95.605
|
||||
0.75,0.75,60000,CHECK FILTER,4722.922,1.383,95.605
|
||||
0.75,0.8,60000,CHECK FILTER,3977.968,1.383,95.605
|
||||
0.75,0.85,60000,CHECK FILTER,3635.288,1.383,95.605
|
||||
0.75,0.7,60000,NN FILTER,3874.915,1.383,95.605
|
||||
0.75,0.75,60000,NN FILTER,3786.562,1.383,95.605
|
||||
0.75,0.8,60000,NN FILTER,3901.219,1.383,95.605
|
||||
0.75,0.85,60000,NN FILTER,3541.992,1.383,95.605
|
||||
|
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 189 KiB |
|
After Width: | Height: | Size: 188 KiB |
@@ -0,0 +1,2 @@
|
||||
experiment name,elem/set,tokens/elem
|
||||
Schema Matching,4.839676,7.059130404597332
|
||||
|
@@ -0,0 +1,21 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.0,0.7,12000,12000,162.511,1.149,10.633
|
||||
0.0,0.7,24000,24000,629.266,0.912,-14.359
|
||||
0.0,0.7,36000,36000,1448.696,1.047,-3.805
|
||||
0.0,0.7,48000,48000,2589.084,0.36,8.324
|
||||
0.0,0.7,60000,60000,4018.602,1.276,30.07
|
||||
0.0,0.75,12000,12000,156.237,0.079,0.0
|
||||
0.0,0.75,24000,24000,601.804,0.166,0.0
|
||||
0.0,0.75,36000,36000,1391.051,0.258,14.434
|
||||
0.0,0.75,48000,48000,2485.407,1.142,23.73
|
||||
0.0,0.75,60000,60000,3865.861,1.259,20.078
|
||||
0.0,0.8,12000,12000,150.844,0.075,0.0
|
||||
0.0,0.8,24000,24000,579.687,0.169,0.0
|
||||
0.0,0.8,36000,36000,1337.54,0.259,6.953
|
||||
0.0,0.8,48000,48000,2393.576,0.365,29.129
|
||||
0.0,0.8,60000,60000,3731.672,1.298,29.992
|
||||
0.0,0.85,12000,12000,146.417,0.077,0.0
|
||||
0.0,0.85,24000,24000,565.317,0.903,-2.0
|
||||
0.0,0.85,36000,36000,1303.856,1.025,7.91
|
||||
0.0,0.85,48000,48000,2328.478,1.158,11.004
|
||||
0.0,0.85,60000,60000,3636.522,1.285,28.184
|
||||
|
|
After Width: | Height: | Size: 248 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.0,0.7,60000,SigType.WEIGHTED,5355.864,1.44,96.559
|
||||
0.0,0.75,60000,SigType.WEIGHTED,4770.741,1.44,96.559
|
||||
0.0,0.8,60000,SigType.WEIGHTED,4016.552,1.44,96.559
|
||||
0.0,0.85,60000,SigType.WEIGHTED,3652.589,1.44,96.559
|
||||
0.0,0.7,60000,SigType.SKYLINE,5320.789,1.44,96.559
|
||||
0.0,0.75,60000,SigType.SKYLINE,4754.873,1.44,96.559
|
||||
0.0,0.8,60000,SigType.SKYLINE,3993.905,1.44,96.559
|
||||
0.0,0.85,60000,SigType.SKYLINE,3637.896,1.44,96.559
|
||||
0.0,0.7,60000,SigType.DICHOTOMY,5314.17,1.44,96.559
|
||||
0.0,0.75,60000,SigType.DICHOTOMY,4747.451,1.44,96.559
|
||||
0.0,0.8,60000,SigType.DICHOTOMY,3987.966,1.44,96.559
|
||||
0.0,0.85,60000,SigType.DICHOTOMY,3639.406,1.44,96.559
|
||||
0.25,0.7,60000,SigType.WEIGHTED,5286.204,1.44,96.559
|
||||
0.25,0.75,60000,SigType.WEIGHTED,4740.2,1.44,96.559
|
||||
0.25,0.8,60000,SigType.WEIGHTED,3988.353,1.44,96.559
|
||||
0.25,0.85,60000,SigType.WEIGHTED,3621.661,1.44,96.559
|
||||
0.25,0.7,60000,SigType.SKYLINE,5272.151,1.44,96.559
|
||||
0.25,0.75,60000,SigType.SKYLINE,4793.404,1.44,96.559
|
||||
0.25,0.8,60000,SigType.SKYLINE,4270.868,1.44,96.559
|
||||
0.25,0.85,60000,SigType.SKYLINE,3897.66,1.44,96.559
|
||||
0.25,0.7,60000,SigType.DICHOTOMY,5280.093,1.44,96.559
|
||||
0.25,0.75,60000,SigType.DICHOTOMY,4728.997,1.44,96.559
|
||||
0.25,0.8,60000,SigType.DICHOTOMY,3971.004,1.44,96.559
|
||||
0.25,0.85,60000,SigType.DICHOTOMY,3612.607,1.44,96.559
|
||||
0.5,0.7,60000,SigType.WEIGHTED,5191.199,1.44,96.559
|
||||
0.5,0.75,60000,SigType.WEIGHTED,4656.862,1.44,96.559
|
||||
0.5,0.8,60000,SigType.WEIGHTED,3920.386,1.44,96.559
|
||||
0.5,0.85,60000,SigType.WEIGHTED,3580.435,1.44,96.559
|
||||
0.5,0.7,60000,SigType.SKYLINE,5180.493,1.44,96.559
|
||||
0.5,0.75,60000,SigType.SKYLINE,4622.431,1.44,96.559
|
||||
0.5,0.8,60000,SigType.SKYLINE,3871.093,1.44,96.559
|
||||
0.5,0.85,60000,SigType.SKYLINE,3525.577,1.44,96.559
|
||||
0.5,0.7,60000,SigType.DICHOTOMY,5112.984,1.44,96.559
|
||||
0.5,0.75,60000,SigType.DICHOTOMY,4605.999,1.44,96.559
|
||||
0.5,0.8,60000,SigType.DICHOTOMY,3876.706,1.44,96.559
|
||||
0.5,0.85,60000,SigType.DICHOTOMY,3526.946,1.44,96.559
|
||||
0.75,0.7,60000,SigType.WEIGHTED,5031.754,1.44,96.559
|
||||
0.75,0.75,60000,SigType.WEIGHTED,4539.266,1.44,96.559
|
||||
0.75,0.8,60000,SigType.WEIGHTED,3854.313,1.44,96.559
|
||||
0.75,0.85,60000,SigType.WEIGHTED,3529.814,1.44,96.559
|
||||
0.75,0.7,60000,SigType.SKYLINE,5037.338,1.44,96.559
|
||||
0.75,0.75,60000,SigType.SKYLINE,4546.784,1.44,96.559
|
||||
0.75,0.8,60000,SigType.SKYLINE,3843.47,1.44,96.559
|
||||
0.75,0.85,60000,SigType.SKYLINE,3524.44,1.44,96.559
|
||||
0.75,0.7,60000,SigType.DICHOTOMY,5252.169,1.44,96.559
|
||||
0.75,0.75,60000,SigType.DICHOTOMY,4699.463,1.44,96.559
|
||||
0.75,0.8,60000,SigType.DICHOTOMY,3928.414,1.44,96.559
|
||||
0.75,0.85,60000,SigType.DICHOTOMY,3565.332,1.44,96.559
|
||||
|
|
After Width: | Height: | Size: 207 KiB |
|
After Width: | Height: | Size: 211 KiB |
|
After Width: | Height: | Size: 219 KiB |
|
After Width: | Height: | Size: 210 KiB |
@@ -0,0 +1,13 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.8,0.7,10000,NO FILTER,3180.351,0.686,63.961
|
||||
0.8,0.75,10000,NO FILTER,2729.108,0.686,63.961
|
||||
0.8,0.8,10000,NO FILTER,2185.09,0.686,63.961
|
||||
0.8,0.85,10000,NO FILTER,1542.041,0.686,63.961
|
||||
0.8,0.7,10000,CHECK FILTER,2329.334,0.686,63.961
|
||||
0.8,0.75,10000,CHECK FILTER,2012.022,0.686,63.961
|
||||
0.8,0.8,10000,CHECK FILTER,1609.739,0.686,63.961
|
||||
0.8,0.85,10000,CHECK FILTER,1140.994,0.686,63.961
|
||||
0.8,0.7,10000,NN FILTER,448.129,0.686,63.961
|
||||
0.8,0.75,10000,NN FILTER,388.975,0.686,63.961
|
||||
0.8,0.8,10000,NN FILTER,315.568,0.686,63.961
|
||||
0.8,0.85,10000,NN FILTER,232.207,0.686,63.961
|
||||
|
|
After Width: | Height: | Size: 159 KiB |
@@ -0,0 +1,13 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.8,0.7,10000,SigType.WEIGHTED,3215.981,0.686,64.16
|
||||
0.8,0.75,10000,SigType.WEIGHTED,2754.485,0.686,64.16
|
||||
0.8,0.8,10000,SigType.WEIGHTED,2201.524,0.686,64.16
|
||||
0.8,0.85,10000,SigType.WEIGHTED,1558.372,0.686,64.16
|
||||
0.8,0.7,10000,SigType.SKYLINE,3200.56,0.686,64.16
|
||||
0.8,0.75,10000,SigType.SKYLINE,2757.303,0.686,64.16
|
||||
0.8,0.8,10000,SigType.SKYLINE,55.38,0.686,64.16
|
||||
0.8,0.85,10000,SigType.SKYLINE,20.134,0.686,64.16
|
||||
0.8,0.7,10000,SigType.DICHOTOMY,3151.663,0.686,64.16
|
||||
0.8,0.75,10000,SigType.DICHOTOMY,2613.546,0.686,64.16
|
||||
0.8,0.8,10000,SigType.DICHOTOMY,52.873,0.686,64.16
|
||||
0.8,0.85,10000,SigType.DICHOTOMY,19.331,0.686,64.16
|
||||
|
|
After Width: | Height: | Size: 199 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.7,0.7,5000,NO FILTER,3145.41,0.391,28.309
|
||||
0.7,0.75,5000,NO FILTER,2687.395,0.391,28.309
|
||||
0.7,0.8,5000,NO FILTER,2244.686,0.391,28.309
|
||||
0.7,0.85,5000,NO FILTER,1650.297,0.391,28.309
|
||||
0.7,0.7,5000,CHECK FILTER,4118.279,0.391,28.309
|
||||
0.7,0.75,5000,CHECK FILTER,3601.918,0.391,28.309
|
||||
0.7,0.8,5000,CHECK FILTER,2874.443,0.391,28.309
|
||||
0.7,0.85,5000,CHECK FILTER,2044.612,0.391,28.309
|
||||
0.7,0.7,5000,NN FILTER,630.678,0.391,28.309
|
||||
0.7,0.75,5000,NN FILTER,562.722,0.391,28.309
|
||||
0.7,0.8,5000,NN FILTER,483.175,0.391,28.309
|
||||
0.7,0.85,5000,NN FILTER,394.221,0.391,28.309
|
||||
0.75,0.7,5000,NO FILTER,2189.373,0.391,28.309
|
||||
0.75,0.75,5000,NO FILTER,1891.061,0.391,28.309
|
||||
0.75,0.8,5000,NO FILTER,1516.5,0.391,28.309
|
||||
0.75,0.85,5000,NO FILTER,1073.123,0.391,28.309
|
||||
0.75,0.7,5000,CHECK FILTER,2222.872,0.391,28.309
|
||||
0.75,0.75,5000,CHECK FILTER,1913.937,0.391,28.309
|
||||
0.75,0.8,5000,CHECK FILTER,1542.112,0.391,28.309
|
||||
0.75,0.85,5000,CHECK FILTER,1086.385,0.391,28.309
|
||||
0.75,0.7,5000,NN FILTER,304.748,0.391,28.309
|
||||
0.75,0.75,5000,NN FILTER,265.773,0.391,28.309
|
||||
0.75,0.8,5000,NN FILTER,217.404,0.391,28.309
|
||||
0.75,0.85,5000,NN FILTER,162.876,0.391,28.309
|
||||
0.8,0.7,5000,NO FILTER,858.698,0.391,28.309
|
||||
0.8,0.75,5000,NO FILTER,745.085,0.391,28.309
|
||||
0.8,0.8,5000,NO FILTER,596.28,0.391,28.309
|
||||
0.8,0.85,5000,NO FILTER,421.34,0.391,28.309
|
||||
0.8,0.7,5000,CHECK FILTER,636.886,0.391,28.309
|
||||
0.8,0.75,5000,CHECK FILTER,550.521,0.391,28.309
|
||||
0.8,0.8,5000,CHECK FILTER,443.218,0.391,28.309
|
||||
0.8,0.85,5000,CHECK FILTER,313.208,0.391,28.309
|
||||
0.8,0.7,5000,NN FILTER,120.012,0.391,28.309
|
||||
0.8,0.75,5000,NN FILTER,103.497,0.391,28.309
|
||||
0.8,0.8,5000,NN FILTER,85.033,0.391,28.309
|
||||
0.8,0.85,5000,NN FILTER,62.035,0.391,28.309
|
||||
0.85,0.7,5000,NO FILTER,446.251,0.391,28.309
|
||||
0.85,0.75,5000,NO FILTER,386.611,0.391,28.309
|
||||
0.85,0.8,5000,NO FILTER,309.98,0.391,28.309
|
||||
0.85,0.85,5000,NO FILTER,217.511,0.391,28.309
|
||||
0.85,0.7,5000,CHECK FILTER,364.622,0.391,28.309
|
||||
0.85,0.75,5000,CHECK FILTER,323.038,0.391,28.309
|
||||
0.85,0.8,5000,CHECK FILTER,263.697,0.391,28.309
|
||||
0.85,0.85,5000,CHECK FILTER,184.893,0.391,28.309
|
||||
0.85,0.7,5000,NN FILTER,72.101,0.391,28.309
|
||||
0.85,0.75,5000,NN FILTER,62.971,0.391,28.309
|
||||
0.85,0.8,5000,NN FILTER,51.582,0.391,28.309
|
||||
0.85,0.85,5000,NN FILTER,35.586,0.391,28.309
|
||||
|
|
After Width: | Height: | Size: 163 KiB |
|
After Width: | Height: | Size: 151 KiB |
|
After Width: | Height: | Size: 166 KiB |
|
After Width: | Height: | Size: 154 KiB |
@@ -0,0 +1,2 @@
|
||||
experiment name,elem/set,tokens/elem
|
||||
String Matching,9.847735909042738,6.878140889891579
|
||||
|
@@ -0,0 +1,24 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,set_size,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.8,0.7,2500,2500,20.354,0.269,9.504
|
||||
0.8,0.7,5000,5000,113.087,0.29,11.926
|
||||
0.8,0.7,10000,10000,454.182,0.516,28.629
|
||||
0.8,0.7,20000,20000,1579.457,1.158,55.445
|
||||
0.8,0.7,40000,40000,9424.586,2.476,204.078
|
||||
0.8,0.75,2500,2500,17.996,0.076,0.0
|
||||
0.8,0.75,5000,5000,99.042,0.311,-0.984
|
||||
0.8,0.75,10000,10000,397.007,0.666,23.434
|
||||
0.8,0.75,20000,20000,1373.817,1.13,67.41
|
||||
0.8,0.75,20000,20000,1360.908,1.176,133.656
|
||||
0.8,0.8,20000,20000,1108.245,1.121,60.602
|
||||
0.8,0.85,20000,20000,826.474,0.952,49.543
|
||||
0.8,0.75,40000,40000,8340.751,2.489,166.387
|
||||
0.8,0.8,2500,2500,14.693,0.076,0.0
|
||||
0.8,0.8,5000,5000,81.646,0.299,-4.938
|
||||
0.8,0.8,10000,10000,324.125,0.526,20.617
|
||||
0.8,0.8,20000,20000,1114.55,1.36,68.348
|
||||
0.8,0.8,40000,40000,6704.746,2.626,212.395
|
||||
0.8,0.85,2500,2500,11.171,0.169,-86.672
|
||||
0.8,0.85,5000,5000,59.849,0.17,0.0
|
||||
0.8,0.85,10000,10000,237.911,0.67,35.836
|
||||
0.8,0.85,20000,20000,825.885,1.155,59.43
|
||||
0.8,0.85,40000,40000,4904.373,2.558,198.414
|
||||
|
|
After Width: | Height: | Size: 221 KiB |
@@ -0,0 +1,49 @@
|
||||
similarity_threshold,related_threshold,source_set_amount,label,elapsed_time,inverted_index_time,inverted_index_ram_usage
|
||||
0.7,0.7,5000,SigType.WEIGHTED,3142.821,0.377,28.309
|
||||
0.7,0.75,5000,SigType.WEIGHTED,2682.225,0.377,28.309
|
||||
0.7,0.8,5000,SigType.WEIGHTED,2239.236,0.377,28.309
|
||||
0.7,0.85,5000,SigType.WEIGHTED,1645.65,0.377,28.309
|
||||
0.7,0.7,5000,SigType.SKYLINE,2961.813,0.377,28.309
|
||||
0.7,0.75,5000,SigType.SKYLINE,2370.479,0.377,28.309
|
||||
0.7,0.8,5000,SigType.SKYLINE,1638.268,0.377,28.309
|
||||
0.7,0.85,5000,SigType.SKYLINE,873.634,0.377,28.309
|
||||
0.7,0.7,5000,SigType.DICHOTOMY,2971.812,0.377,28.309
|
||||
0.7,0.75,5000,SigType.DICHOTOMY,2330.936,0.377,28.309
|
||||
0.7,0.8,5000,SigType.DICHOTOMY,1601.018,0.377,28.309
|
||||
0.7,0.85,5000,SigType.DICHOTOMY,850.349,0.377,28.309
|
||||
0.75,0.7,5000,SigType.WEIGHTED,2191.563,0.377,28.309
|
||||
0.75,0.75,5000,SigType.WEIGHTED,1893.747,0.377,28.309
|
||||
0.75,0.8,5000,SigType.WEIGHTED,1521.498,0.377,28.309
|
||||
0.75,0.85,5000,SigType.WEIGHTED,1067.3,0.377,28.309
|
||||
0.75,0.7,5000,SigType.SKYLINE,2194.143,0.377,28.309
|
||||
0.75,0.75,5000,SigType.SKYLINE,1885.513,0.377,28.309
|
||||
0.75,0.8,5000,SigType.SKYLINE,243.915,0.377,28.309
|
||||
0.75,0.85,5000,SigType.SKYLINE,76.608,0.377,28.309
|
||||
0.75,0.7,5000,SigType.DICHOTOMY,2193.586,0.377,28.309
|
||||
0.75,0.75,5000,SigType.DICHOTOMY,1891.043,0.377,28.309
|
||||
0.75,0.8,5000,SigType.DICHOTOMY,243.111,0.377,28.309
|
||||
0.75,0.85,5000,SigType.DICHOTOMY,76.276,0.377,28.309
|
||||
0.8,0.7,5000,SigType.WEIGHTED,863.808,0.377,28.309
|
||||
0.8,0.75,5000,SigType.WEIGHTED,741.957,0.377,28.309
|
||||
0.8,0.8,5000,SigType.WEIGHTED,593.837,0.377,28.309
|
||||
0.8,0.85,5000,SigType.WEIGHTED,417.66,0.377,28.309
|
||||
0.8,0.7,5000,SigType.SKYLINE,856.288,0.377,28.309
|
||||
0.8,0.75,5000,SigType.SKYLINE,740.398,0.377,28.309
|
||||
0.8,0.8,5000,SigType.SKYLINE,16.179,0.377,28.309
|
||||
0.8,0.85,5000,SigType.SKYLINE,6.176,0.377,28.309
|
||||
0.8,0.7,5000,SigType.DICHOTOMY,874.258,0.377,28.309
|
||||
0.8,0.75,5000,SigType.DICHOTOMY,745.18,0.377,28.309
|
||||
0.8,0.8,5000,SigType.DICHOTOMY,15.214,0.377,28.309
|
||||
0.8,0.85,5000,SigType.DICHOTOMY,5.962,0.377,28.309
|
||||
0.85,0.7,5000,SigType.WEIGHTED,428.452,0.377,28.309
|
||||
0.85,0.75,5000,SigType.WEIGHTED,360.578,0.377,28.309
|
||||
0.85,0.8,5000,SigType.WEIGHTED,290.172,0.377,28.309
|
||||
0.85,0.85,5000,SigType.WEIGHTED,200.998,0.377,28.309
|
||||
0.85,0.7,5000,SigType.SKYLINE,411.051,0.377,28.309
|
||||
0.85,0.75,5000,SigType.SKYLINE,359.539,0.377,28.309
|
||||
0.85,0.8,5000,SigType.SKYLINE,285.937,0.377,28.309
|
||||
0.85,0.85,5000,SigType.SKYLINE,3.169,0.377,28.309
|
||||
0.85,0.7,5000,SigType.DICHOTOMY,416.254,0.377,28.309
|
||||
0.85,0.75,5000,SigType.DICHOTOMY,383.443,0.377,28.309
|
||||
0.85,0.8,5000,SigType.DICHOTOMY,288.102,0.377,28.309
|
||||
0.85,0.85,5000,SigType.DICHOTOMY,3.283,0.377,28.309
|
||||
|
|
After Width: | Height: | Size: 213 KiB |
|
After Width: | Height: | Size: 174 KiB |
|
After Width: | Height: | Size: 182 KiB |
|
After Width: | Height: | Size: 198 KiB |
176
experiments/run.py
Normal file
@@ -0,0 +1,176 @@
|
||||
# Python
|
||||
import multiprocessing
|
||||
from experiments import run_experiment_filter_schemes, run_reduction_experiment, run_scalability_experiment, run_matching_without_silkmoth_inc_dep
|
||||
import os
|
||||
from data_loader import DataLoader
|
||||
from utils import load_sets_from_files
|
||||
from src.silkmoth.utils import jaccard_similarity, contain, similar, SigType, edit_similarity
|
||||
|
||||
|
||||
def run_experiment_multi(experiment_method, *args):
|
||||
experiment_method(*args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
data_loader = DataLoader("/")
|
||||
|
||||
# Labels for Filter Experiments
|
||||
labels_filter = ["NO FILTER", "CHECK FILTER", "NN FILTER"]
|
||||
|
||||
# Labels for Signature Scheme
|
||||
labels_sig_schemes = [SigType.WEIGHTED, SigType.SKYLINE, SigType.DICHOTOMY]
|
||||
|
||||
# Labels for Reduction
|
||||
labels_reduction = ["REDUCTION", "NO REDUCTION"]
|
||||
|
||||
# Load the datasets for Experiments
|
||||
data_path = os.path.join(os.path.dirname(__file__), "data", "dblp", "DBLP_100k.csv")
|
||||
source_string_matching = data_loader.load_dblp_titles(data_path)
|
||||
source_string_matching = [title.split() for title in source_string_matching]
|
||||
|
||||
try:
|
||||
folder_path = os.path.join(os.path.dirname(__file__), "../experiments/data/webtables")
|
||||
folder_path = os.path.normpath(folder_path)
|
||||
reference_sets_in_dep, source_sets_in_dep = load_sets_from_files(
|
||||
folder_path=folder_path,
|
||||
reference_file="reference_sets_inclusion_dependency.json",
|
||||
source_file="source_sets_inclusion_dependency.json"
|
||||
)
|
||||
|
||||
reference_sets_schema_matching, source_sets_schema_matching = load_sets_from_files(
|
||||
folder_path=folder_path,
|
||||
reference_file="webtable_schemas_sets_500k.json",
|
||||
source_file="webtable_schemas_sets_500k.json"
|
||||
)
|
||||
del reference_sets_schema_matching
|
||||
|
||||
_, github_source_sets_schema_matching = load_sets_from_files(
|
||||
folder_path=folder_path,
|
||||
reference_file="github_webtable_schemas_sets_500k.json",
|
||||
source_file="github_webtable_schemas_sets_500k.json"
|
||||
)
|
||||
|
||||
except FileNotFoundError:
|
||||
print("Datasets not found. Skipping Experiments.")
|
||||
reference_sets_in_dep, source_sets_in_dep, reference_sets_in_dep_reduction = [], [], []
|
||||
source_sets_schema_matching = []
|
||||
github_source_sets_schema_matching = []
|
||||
|
||||
# Experiment configuration
|
||||
experiment_config = {
|
||||
"filter_runs": False,
|
||||
"signature_scheme_runs": False,
|
||||
"reduction_runs": False,
|
||||
"scalability_runs": False,
|
||||
"schema_github_webtable_runs": False,
|
||||
"inc_dep_without_silkmoth": True
|
||||
}
|
||||
|
||||
# Define experiments to run
|
||||
experiments = []
|
||||
|
||||
if experiment_config["filter_runs"]:
|
||||
# Filter runs
|
||||
# String Matching Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.7, 0.75, 0.8, 0.85],
|
||||
labels_filter, source_string_matching[:10_000], None, similar, edit_similarity , False,
|
||||
"string_matching_filter", "results/string_matching/"
|
||||
))
|
||||
|
||||
# Schema Matching Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.0, 0.25, 0.5, 0.75],
|
||||
labels_filter, source_sets_schema_matching[:60_000], None, similar, jaccard_similarity, False,
|
||||
"schema_matching_filter", "results/schema_matching/"
|
||||
))
|
||||
|
||||
# Inclusion Dependency Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.0, 0.25, 0.5, 0.75],
|
||||
labels_filter, source_sets_in_dep, reference_sets_in_dep[:200], contain, jaccard_similarity, True,
|
||||
"inclusion_dependency_filter", "results/inclusion_dependency/"
|
||||
))
|
||||
|
||||
|
||||
|
||||
if experiment_config["signature_scheme_runs"]:
|
||||
# Signature Scheme Runs
|
||||
#String Matching Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.7, 0.75, 0.8, 0.85],
|
||||
labels_sig_schemes, source_string_matching[:10_000], None, similar, edit_similarity , False,
|
||||
"string_matching_sig", "results/string_matching/"
|
||||
))
|
||||
|
||||
# Schema Matching Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.0, 0.25, 0.5, 0.75],
|
||||
labels_sig_schemes, source_sets_schema_matching[:60_000], None, similar, jaccard_similarity, False,
|
||||
"schema_matching_sig", "results/schema_matching/"
|
||||
))
|
||||
|
||||
# Inclusion Dependency Experiment
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.0, 0.25, 0.5, 0.75],
|
||||
labels_sig_schemes, source_sets_in_dep, reference_sets_in_dep[:200], contain, jaccard_similarity, True,
|
||||
"inclusion_dependency_sig", "results/inclusion_dependency/"
|
||||
))
|
||||
|
||||
|
||||
if experiment_config["reduction_runs"]:
|
||||
# Reduction Runs
|
||||
experiments.append((
|
||||
run_reduction_experiment, [0.7, 0.75, 0.8, 0.85], 0.0,
|
||||
labels_reduction, source_sets_in_dep, reference_sets_in_dep[:200], contain, jaccard_similarity, True,
|
||||
"inclusion_dependency_reduction", "results/inclusion_dependency/"
|
||||
))
|
||||
|
||||
if experiment_config["scalability_runs"]:
|
||||
# Scalability Runs
|
||||
# String Matching
|
||||
experiments.append((
|
||||
run_scalability_experiment, [0.7, 0.75, 0.8, 0.85], 0.7, [1_000, 10_000, 100_000],
|
||||
source_string_matching[:100_000], None, similar, edit_similarity, False,
|
||||
"string_matching_scalability", "results/string_matching/"
|
||||
))
|
||||
|
||||
# Inclusion Dependency
|
||||
experiments.append((
|
||||
run_scalability_experiment, [0.7, 0.75, 0.8, 0.85], 0.5, [100_000, 200_000, 300_000, 400_000, 500_000],
|
||||
source_sets_in_dep, reference_sets_in_dep[:200], contain, jaccard_similarity, True,
|
||||
"inclusion_dependency_scalability", "results/inclusion_dependency/"
|
||||
))
|
||||
|
||||
# Schema Matching
|
||||
experiments.append((
|
||||
run_scalability_experiment, [0.7, 0.75, 0.8, 0.85], 0.0, [12_000, 24_000, 36_000, 48_000, 60_000],
|
||||
source_sets_schema_matching[:60_000], None, similar, jaccard_similarity, False,
|
||||
"schema_matching_scalability", "results/schema_matching/"
|
||||
))
|
||||
|
||||
if experiment_config["schema_github_webtable_runs"]:
|
||||
# Schema Matching with GitHub Webtable Schemas
|
||||
experiments.append((
|
||||
run_experiment_filter_schemes, [0.7, 0.75, 0.8, 0.85], [0.0, 0.25, 0.5, 0.75],
|
||||
labels_filter, source_sets_schema_matching[:10_000], github_source_sets_schema_matching[:10_000], similar, jaccard_similarity, True,
|
||||
"github_webtable_schema_matching", "results/schema_matching/"
|
||||
))
|
||||
|
||||
if experiment_config["inc_dep_without_silkmoth"]:
|
||||
experiments.append((
|
||||
run_matching_without_silkmoth_inc_dep, source_sets_in_dep[:500_000], reference_sets_in_dep[:200], [0.7, 0.75, 0.8, 0.85], 0.5, contain, jaccard_similarity,
|
||||
"raw_matching", "results/inclusion_dependency/"
|
||||
))
|
||||
|
||||
# Create and start processes for each experiment
|
||||
processes = []
|
||||
for experiment in experiments:
|
||||
method, *args = experiment
|
||||
process = multiprocessing.Process(target=run_experiment_multi, args=(method, *args))
|
||||
processes.append(process)
|
||||
process.start()
|
||||
|
||||
# Wait for all processes to complete
|
||||
for process in processes:
|
||||
process.join()
|
||||
BIN
experiments/silkmoth_results/inclusion_dep_filter.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
experiments/silkmoth_results/inclusion_dep_red.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
experiments/silkmoth_results/inclusion_dep_scal.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
experiments/silkmoth_results/inclusion_dep_sig.png
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
experiments/silkmoth_results/schema_matching_filter.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
experiments/silkmoth_results/schema_matching_scal.png
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
experiments/silkmoth_results/schema_matching_sig.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
experiments/silkmoth_results/string_matching_filter.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
experiments/silkmoth_results/string_matching_scal.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
BIN
experiments/silkmoth_results/string_matching_sig.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
132
experiments/utils.py
Normal file
@@ -0,0 +1,132 @@
|
||||
from collections import defaultdict
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import json
|
||||
import os
|
||||
import pandas as pd
|
||||
import psutil
|
||||
from src.silkmoth.utils import jaccard_similarity
|
||||
from src.silkmoth.tokenizer import Tokenizer
|
||||
|
||||
def is_convertible_to_number(value):
|
||||
try:
|
||||
float(value)
|
||||
return True
|
||||
except ValueError:
|
||||
return False
|
||||
|
||||
def save_sets_to_files(reference_sets, source_sets, reference_file="reference_sets.json", source_file="source_sets.json"):
|
||||
"""
|
||||
Saves reference sets and source sets to their respective JSON files.
|
||||
|
||||
Args:
|
||||
reference_sets (list): The reference sets to save.
|
||||
source_sets (list): The source sets to save.
|
||||
reference_file (str): The file name for saving reference sets.
|
||||
source_file (str): The file name for saving source sets.
|
||||
"""
|
||||
with open(reference_file, 'w', encoding='utf-8') as ref_file:
|
||||
json.dump(reference_sets, ref_file, ensure_ascii=False, indent=4)
|
||||
|
||||
with open(source_file, 'w', encoding='utf-8') as src_file:
|
||||
json.dump(source_sets, src_file, ensure_ascii=False, indent=4)
|
||||
|
||||
def load_sets_from_files(folder_path: str, reference_file: str = "reference_sets.json", source_file: str = "source_sets.json") -> tuple[list, list]:
|
||||
source_path = os.path.join(folder_path, source_file)
|
||||
reference_path = os.path.join(folder_path, reference_file)
|
||||
|
||||
# Check if the files exist
|
||||
if not os.path.exists(source_path) or not os.path.exists(reference_path):
|
||||
raise FileNotFoundError("One or both of the required files do not exist in the specified folder.")
|
||||
|
||||
# Load the reference sets
|
||||
with open(reference_path, 'r', encoding='utf-8') as ref_file:
|
||||
reference_sets = json.load(ref_file)
|
||||
# Load the source sets
|
||||
with open(source_path, 'r', encoding='utf-8') as src_file:
|
||||
source_sets = json.load(src_file)
|
||||
|
||||
return reference_sets, source_sets
|
||||
|
||||
def measure_ram_usage():
|
||||
process = psutil.Process()
|
||||
return process.memory_info().rss / (1024 ** 2)
|
||||
|
||||
|
||||
def plot_elapsed_times(related_thresholds, elapsed_times_list, fig_text, file_name, xlabel=r'$\theta$', ylabel='Time (s)', title=None, legend_labels=None):
|
||||
"""
|
||||
Utility function to plot elapsed times against related thresholds for multiple settings.
|
||||
|
||||
Args:
|
||||
related_thresholds (list): Related thresholds (x-axis values).
|
||||
elapsed_times_list (list of lists): List of elapsed times (y-axis values) for different settings.
|
||||
fig_text (str): Text to display on the figure.
|
||||
file_name (str): Name of the file to save the plot.
|
||||
xlabel (str): Label for the x-axis.
|
||||
ylabel (str): Label for the y-axis.
|
||||
title (str): Title of the plot (optional).
|
||||
legend_labels (list): List of labels for the legend (optional).
|
||||
"""
|
||||
fig = plt.figure(figsize=(8, 6))
|
||||
|
||||
# Plot each elapsed_times list with a different color and label
|
||||
for i, elapsed_times in enumerate(elapsed_times_list):
|
||||
label = legend_labels[i] if legend_labels and i < len(legend_labels) else f"Setting {i + 1}"
|
||||
plt.plot(related_thresholds, elapsed_times, marker='o', label=label)
|
||||
|
||||
plt.xlabel(xlabel, fontsize=14)
|
||||
plt.ylabel(ylabel, fontsize=14)
|
||||
|
||||
plt.xticks(related_thresholds)
|
||||
|
||||
if title:
|
||||
plt.title(title, fontsize=16)
|
||||
|
||||
plt.grid(True)
|
||||
if legend_labels:
|
||||
plt.legend(fontsize=12)
|
||||
plt.tight_layout()
|
||||
|
||||
# Add figure text
|
||||
plt.figtext(0.1, 0.01, fig_text, ha='left', fontsize=10)
|
||||
|
||||
# Save the figure
|
||||
plt.savefig(f"{file_name}", bbox_inches='tight', dpi=300)
|
||||
|
||||
def save_experiment_results_to_csv(results, file_name):
|
||||
"""
|
||||
Appends experiment results to a CSV file.
|
||||
|
||||
Args:
|
||||
results (dict):
|
||||
file_name (str): Name of the CSV file to save the results.
|
||||
"""
|
||||
df = pd.DataFrame([results])
|
||||
|
||||
# Append to the file if it exists, otherwise create a new file
|
||||
df.to_csv(f"{file_name}", mode='a', header=not os.path.exists(file_name), index=False)
|
||||
|
||||
def calculate_set_ratios(source_set, sim_func):
|
||||
tokenizer = Tokenizer(sim_func)
|
||||
|
||||
total_elements = 0
|
||||
total_tokens = 0
|
||||
|
||||
for s in source_set:
|
||||
total_elements += len(s)
|
||||
for element in s:
|
||||
total_tokens += len(tokenizer.tokenize(element))
|
||||
|
||||
return total_elements/len(source_set), total_tokens/total_elements
|
||||
|
||||
def experiment_set_ratio_calc(source_set, sim_func , folder, experiment_name):
|
||||
elem_set, tokens_elem = calculate_set_ratios(source_set, sim_func)
|
||||
data = {
|
||||
"experiment name": experiment_name,
|
||||
"elem/set": elem_set,
|
||||
"tokens/elem": tokens_elem,
|
||||
}
|
||||
save_experiment_results_to_csv(data, folder)
|
||||
|
||||
|
||||
|
||||