Colleagues, good day!
We would like to reach out to you for assistance in verifying the results we have obtained.
We employ our own method for performing deduplication, clustering, and data matching tasks. This method allows us to obtain a numerical value of the similarity between text excerpts (including data table rows) without the need for model training. Based on this similarity score, we can determine whether records match or not, and perform deduplication and clustering accordingly.
This is a direct-action algorithm, relatively fast and resource-efficient, requiring no specific configuration (it is versatile). It can be used for quickly assessing previously unexplored data or in environments where data formats change rapidly (but not the core data content), and retraining models is too costly. It can serve as the foundation for creating personalized desktop data processing systems on consumer-grade computers.
We would like to evaluate the quality of this algorithm in quantitative terms, but we cannot find widely accepted methods for such an assessment. Additionally, we lack well-annotated datasets for evaluating the quality of matching.
If anyone is willing and able to contribute to the development of this topic, please step forward.
Sincerely, The KnoDL Team