Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / SQL

Full text search with Oracle Text

0.00/5 (No votes)
8 Nov 2005CPOL4 min read 1   400  
How to do full text search with Oracle Text

Sample Image - ora_fulltext.gif

Introduction

Have you ever had a problem like this: I need all documents on my hard drive containing the words Oracle and SQL Server but not MS Access. Do you have Oracle 10g and about 2 hours time? Yes! Then I will show you an easy way to index your documents using Oracle Text. As I am a fan of low tech solutions, everything you need is a text editor and Oracle. We create this solution in three steps:

  • Create a table, index, and Stored Procedure using SQL*Plus
  • Import the documents with VBScript
  • Create a user interface (GUI) for searching the documents

The Oracle part

To start with the Oracle objects, open SQL*Plus. First, we create a new table my_doc with file_name, upload_date, and content as main fields:

SQL
CREATE TABLE my_doc (
   id NUMBER,
   file_name VARCHAR2(255),
   upload_date VARCHAR2(10),
   filesize VARCHAR2(20),
   mime VARCHAR2(50),
   content BFILE,
 CONSTRAINT doc_pk PRIMARY KEY (id)
);

The field content is defined as type BFILE. This means the content of the document is not stored in Oracle. This is perfectly OK, as we only need the index. The next thing we will do is: define a directory that will be used as the root folder for the documents that we want to index. Grant read access for this directory to user ctxsys.

SQL
CREATE OR REPLACE DIRECTORY documents AS 'C:\_work\datastore\';
GRANT READ ON DIRECTORY documents TO ctxsys;

Next, we create an index of type CTXSYS.CONTEXT:

SQL
CREATE INDEX my_doc_idx ON my_doc(content) INDEXTYPE IS CTXSYS.CONTEXT 
    PARAMETERS ('SYNC ( ON COMMIT)');

Finally, we create the Stored Procedure put_file:

SQL
CREATE OR REPLACE PROCEDURE put_file
 (
  p_file_name  IN  my_doc.file_name%TYPE,
  p_upload_date IN my_doc.upload_date%TYPE,
  p_filesize IN my_doc.filesize%TYPE,
  p_mime IN my_doc.mime%TYPE
) AS
   index_name varchar2(20) := 'my_doc_idx';
 BEGIN
   INSERT INTO my_doc (id, file_name, upload_date,
   filesize, mime, content)
   VALUES (my_doc_seq.NEXTVAL, p_file_name, p_upload_date,
   p_filesize, p_mime,
   BFILENAME('DOKUMENTE',p_file_name));
   COMMIT;

That's it - we are finished creating the Oracle objects. Now we can test the solution with SQL*Plus:

SQL
exec put_file('test.doc', '10.10.2005', '100', 'doc');

When you execute the above statement, be sure that the file test.doc exists in the root folder of the indexed directory. You should also check for errors in the table ctx_user_index_errors using the following statement:

SQL
SELECT err_index_name, err_timestamp,err_text FROM 
       ctx_user_index_errors ORDER BY err_timestamp;

When you have no errors in this table, continue with the VBScript to insert more documents into the table and the index. To check the contents of the index table, you can use the script index_stats.sql that is also in the zipped source files.

Importing the data

Next, we will create a simple VBScript file to import all the documents in a certain folder. First, we open the database connection and get access to the root directory. We insert each file from the directory into the table my_doc using the procedure put_file. In the end, we close the database connection and cleanup all used objects. Here is the complete source of the VBScript file:

VBScript
Dim strFolder: strFolder = "C:\_work\datastore\" ' index documents from this directory
' open database connection
Dim strCon: strCon = _
    "Driver={Microsoft ODBC for Oracle}; CONNECTSTRING=ORCL; uid=system;pwd=man;" 
Dim oCon: Set oCon = WScript.CreateObject("ADODB.Connection")
oCon.Open strCon
' insert record for each file
Dim oFs: Set oFs = WScript.CreateObject("Scripting.FileSystemObject")
Dim oFolder: Set oFolder = oFs.GetFolder(strFolder)
For Each oFile In oFolder.Files
 WScript.Echo oFile.Name
 Dim strCmd:
 strCmd = "put_file('" & _
   oFile.Name & "','" & _
   Date() & "','" & _ 
   oFile.Size & "','" & _ 
   oFs.GetExtensionName(oFile) & "')"
 oCon.Execute(strCmd)
Next
' cleanup
oCon.Close
Set oFolder = Nothing
Set oFs = Nothing
Set oCon = Nothing

That is all we need to import the data. I made the code as simple as possible. You should walk down the directory structure to import an entire tree of documents. You should also probably check for errors - reading the table ctx_user_index_errors.

The user interface

As I promised in the introduction, we only need Notepad, even for the user interface. So we use an HTA as the user interface for searching in documents. To do this, we create a file called search.hta. An HTA uses the GUI-objects from Internet Explorer, but does not need a web server (like IIS) to be displayed. First, we define the head-section:

HTML
<html>
<head>
<title>Indexed files</title>
<HTA:APPLICATION 
     ID="objTest" 
     APPLICATIONNAME="HTA Test"
     SCROLL="yes"
     SINGLEINSTANCE="yes"
>
</head>

Please visit the Microsoft website http://www.microsoft.com/technet/scriptcenter/hubs/htas.mspx as the starting point for more information about HTA. Next, we need a script section to perform the search and display the results within the HTA page.

VBScript
<SCRIPT LANGUAGE="VBScript">
Sub StartQuery
 Dim strCon: strCon = "Driver={Microsoft ODBC for Oracle}; " & _ 
                      "CONNECTSTRING=ORCL; uid=system;pwd=man;"
 Dim oCon: Set oCon = CreateObject("ADODB.Connection")
 Dim oRs: Set oRs = CreateObject("ADODB.Recordset")
 oCon.Open strCon
 Dim strSql, strRes
 strSql = "SELECT SCORE(1), file_name, filesize FROM my_doc WHERE CONTAINS(content," & _
          Search.Value & ", 1) > 0 ORDER BY SCORE(1) DESC"
 strRes = "<table border=1 cellpadding=10 cellspacing=0><tr><th>found" & _ 
          "</th><th>title</th><th>size</th><th>action</th></tr>"
 Set oRs = oCon.Execute(strSql)
 While Not oRs.EOF
  strRes = strRes & "<tr><td>" & _
    oRs.Fields(0).Value & "</td><td>" & _
    oRs.Fields(1).Value & "</td><td>" & _
    oRs.Fields(2).Value & "</td><td><a href='datastore/" & _
    oRs.Fields(1).Value & "' target=_blank>Open in new windows</a>" & _
    "</td></tr>"
  oRs.MoveNext
 Wend
 oCon.Close
 strRes = strRes & "</table>"
 DataArea.InnerHtml = strRes
 Set oCon = Nothing
End Sub
</SCRIPT>

Here are some points of interest in the above code:

VBScript
strSql = "SELECT SCORE(1), file_name, filesize FROM my_doc " & _
    WHERE CONTAINS(content," & Search.Value & ", 1) > 0 " & _
    ORDER BY SCORE(1) DESC"

This is a query performing a full text search on the content of the files. For example: the where condition CONTAINS(content, 'Perl', 1) selects only documents that contain the string Perl and gets the number of hits in the document. The variable Search.Value is the value of a text box defined later in the body section of the HTA.

VBScript
DataArea.InnerHtml = strRes

When we are finished formatting the results using HTML tables, we assign the table to 'DataArea.InnerHtml'. The object DataArea is also defined in the body section of the HTA. Here, we see the body section (I removed some formatting code like tables and pictures) to keep things easy:

HTML
<body>
<input type="text" name="Search" size="60">
<input type="button" value="Start Query" 
      name="startQuery" onClick="StartQuery"><p> 
<p>
<span id = "DataArea"></span>
</body>

In the body section, we define a text area named Search. The content of Search.Value acts as the input for the Select statement. Next, we define a button with caption Start Query with an event handler that calls the subroutine StartQuery as defined in the script section. Lastly, we need a DataArea. We assign the formatted output of the query to 'DataArea.InnerHtml'.

Now we are done - you can start the search interface with a double click from Explorer. See some examples of valid searches in the GUI. For more information on searching in documents, have a look at: http://oraclesvca2.oracle.com/docs/cd/B13789_01/text.101/b10729/toc.htm.

Points of interest

As you can see, it is very simple to create an index for your office documents using Oracle Text. But in this solution, we have only a snapshot of the document contents. What happens when you change a document? The index for this document will no longer be valid. So we should take care of updating the index when a document has changed. I already have an idea of how to solve this problem.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)