Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Hosted-services / AWS

Find Duplicate Text

5.00/5 (2 votes)
23 Apr 2019CPOL4 min read 5.9K  
Recommended best practices for finding and removing Duplicate Text/Documents

Introduction

Duplicate records cause numerous problems for business and on top of that, it wastes a lot of efforts. For example, if a client wants to find similar records depending on the few columns to eliminate duplicates which can be benefits to reduce processing, quick decision..

Background

Generally, fixing duplicate records is a manual process that is both tedious and costly. Unless all the details are identical, it is hard to say whether records are duplicate or not. Typically, most potential duplicates are false positives.

Database queries for duplicates will not help to find spelling mistakes, typos, changes few values or rephrasing.

This is the case when we need artificial intelligence (AI) to steps in. We can create & train machine learning algorithm using matching score to find duplicate records. Once trained, AI will predict whether or not records are duplicate or not.

AI Model can be build/trained based on customer requirement, here I will focus on Python, Amazon Elastic Search and Azure Search, we shall look at various options to:

  • Match Query
  • Exact search
  • Percentage base score

Before you take any decision, please respond to the below questions:

  • Do you have Machine Learning (Python) skill resources?
  • Do you have the required infrastructure?
  • Do you have post production support team that could manage and fix if required?
  • Do you have skilled tester that can validate machine learning test result?

If any answer from the above question is No, don’t worry. There are powerful AWS and Azure services available which you can leverage and achieve the similar functionality. Below are the links for the services:

Match Query

The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.

Here, I am sharing the Test Result I performed test using Python, Azure Search and Elastic, this will help you to take a decision which one you should choose.

To perform testing, we have rewritten the original text and added more complexity using online tools and perform testing against the original text.

  Python Azure Search Elastic Search
Total Test Performed 45 45 45
Top Result 36 38 44
Performance in sec 10 3-4 2-3

In the result, you can see Elastic search result is more powerful, it is able to search almost all text and also Performance is better than the other two.

Now let's focus on Elastic Search to know how we can perform a different search.

1. This AWS Elastic Search query will help to find the exact match similar to SQL like query.

SQL Query:

SQL
Select ColumnName from Table where Field like ‘%Search Text%’)

Below is Elasticsearch like query:

SQL
GET /_search
{
   "query": {
      "query_string" : {
      "default_field" : "column name",
      "query" : "search query"
     }
   }
}

Executed using Browser:

http://localhost:9200/idea/_search?q="Text Query"

2. Find percentage base score rather than Relevance Score: Elastic search is capable of returning result based on the threshold defined.

SQL
GET /_search
{
   "query":{
      "multi_match" :{
      "query":"Search Query",
      "fields":[
      "ColumnName"
     ],
    "fuzziness":"AUTO",
    "minimum_should_match":"80%"
   }
  }
}

Comparison Python Model Build using Gensim Library vs ElasticSearch

Test SR# Python Custom Result Elasticsearch Result
Test 1 80% Result with > 0.80% Match
Test 2 84% Result with > 0.70% Match
Test 3 88% Result with > 0.80% Match
Test 4 81% Result with > 0.80% Match
Test 5 0 Not Found
Test 6 84% Result with > 0.80% Match
Test 7 0 Not Found
Test 8 81% Result with > 0.75% Match
Test 9 0 Not Found
Test 10 0 Not Found
Test 11 0 Not Found
Test 12 0 Not Found
Test 13 81% Result with > 0.79% Match
Test 14 84% Result with > 0.80% Match
Test 15 88% Result with > 0.80% Match
Test 16 96% Result with > 0.80% Match
Test 17 96% Result with > 0.80% Match
Test 18 91% Result with > 0.76% Match
Test 19 92% Result with > 0.80% Match
Test 20 90% Result with > 0.80% Match
Test 21 89% Result with > 0.80% Match
Test 22 97% Result with > 0.80% Match
Test 23 0 Not Found
Test 24 95% Result with > 0.80% Match
Test 25 83% Result with > 0.80% Match

Conclusion

It is very difficult to say which result is better when comparing python with Elasticsearch. My recommendation is to use Elasticsearch as it is High-Quality recommended and proven system solution in the market since many years.

History

  • 23rd April, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)