Introduction
Duplicate records cause numerous problems for business and on top of that, it wastes a lot of efforts. For example, if a client wants to find similar records depending on the few columns to eliminate duplicates which can be benefits to reduce processing, quick decision..
Background
Generally, fixing duplicate records is a manual process that is both tedious and costly. Unless all the details are identical, it is hard to say whether records are duplicate or not. Typically, most potential duplicates are false positives.
Database queries for duplicates will not help to find spelling mistakes, typos, changes few values or rephrasing.
This is the case when we need artificial intelligence (AI) to steps in. We can create & train machine learning algorithm using matching score to find duplicate records. Once trained, AI will predict whether or not records are duplicate or not.
AI Model can be build/trained based on customer requirement, here I will focus on Python, Amazon Elastic Search and Azure Search, we shall look at various options to:
- Match Query
- Exact search
- Percentage base score
Before you take any decision, please respond to the below questions:
- Do you have Machine Learning (Python) skill resources?
- Do you have the required infrastructure?
- Do you have post production support team that could manage and fix if required?
- Do you have skilled tester that can validate machine learning test result?
If any answer from the above question is No, don’t worry. There are powerful AWS and Azure services available which you can leverage and achieve the similar functionality. Below are the links for the services:
Match Query
The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
Here, I am sharing the Test Result I performed test using Python, Azure Search and Elastic, this will help you to take a decision which one you should choose.
To perform testing, we have rewritten the original text and added more complexity using online tools and perform testing against the original text.
| Python | Azure Search | Elastic Search |
Total Test Performed | 45 | 45 | 45 |
Top Result | 36 | 38 | 44 |
Performance in sec | 10 | 3-4 | 2-3 |
In the result, you can see Elastic search result is more powerful, it is able to search almost all text and also Performance is better than the other two.
Now let's focus on Elastic Search to know how we can perform a different search.
1. This AWS Elastic Search query will help to find the exact match similar to SQL like query.
SQL Query:
Select ColumnName from Table where Field like ‘%Search Text%’)
Below is Elasticsearch like query:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "column name",
"query" : "search query"
}
}
}
Executed using Browser:
http://localhost:9200/idea/_search?q="Text Query"
2. Find percentage base score rather than Relevance Score: Elastic search is capable of returning result based on the threshold defined.
GET /_search
{
"query":{
"multi_match" :{
"query":"Search Query",
"fields":[
"ColumnName"
],
"fuzziness":"AUTO",
"minimum_should_match":"80%"
}
}
}
Comparison Python Model Build using Gensim Library vs ElasticSearch
Test SR# | Python Custom Result | Elasticsearch Result |
Test 1 | 80% | Result with > 0.80% Match |
Test 2 | 84% | Result with > 0.70% Match |
Test 3 | 88% | Result with > 0.80% Match |
Test 4 | 81% | Result with > 0.80% Match |
Test 5 | 0 | Not Found |
Test 6 | 84% | Result with > 0.80% Match |
Test 7 | 0 | Not Found |
Test 8 | 81% | Result with > 0.75% Match |
Test 9 | 0 | Not Found |
Test 10 | 0 | Not Found |
Test 11 | 0 | Not Found |
Test 12 | 0 | Not Found |
Test 13 | 81% | Result with > 0.79% Match |
Test 14 | 84% | Result with > 0.80% Match |
Test 15 | 88% | Result with > 0.80% Match |
Test 16 | 96% | Result with > 0.80% Match |
Test 17 | 96% | Result with > 0.80% Match |
Test 18 | 91% | Result with > 0.76% Match |
Test 19 | 92% | Result with > 0.80% Match |
Test 20 | 90% | Result with > 0.80% Match |
Test 21 | 89% | Result with > 0.80% Match |
Test 22 | 97% | Result with > 0.80% Match |
Test 23 | 0 | Not Found |
Test 24 | 95% | Result with > 0.80% Match |
Test 25 | 83% | Result with > 0.80% Match |
Conclusion
It is very difficult to say which result is better when comparing python with Elasticsearch. My recommendation is to use Elasticsearch as it is High-Quality recommended and proven system solution in the market since many years.
History
- 23rd April, 2019: Initial version