Introduction
Have you ever had a problem like this: I need all documents on my hard drive containing the words Oracle and SQL Server but not MS Access. Do you have Oracle 10g and about 2 hours time? Yes! Then I will show you an easy way to index your documents using Oracle Text. As I am a fan of low tech solutions, everything you need is a text editor and Oracle. We create this solution in three steps:
- Create a table, index, and Stored Procedure using SQL*Plus
- Import the documents with VBScript
- Create a user interface (GUI) for searching the documents
The Oracle part
To start with the Oracle objects, open SQL*Plus. First, we create a new table my_doc with file_name
, upload_date
, and content
as main fields:
CREATE TABLE my_doc (
id NUMBER,
file_name VARCHAR2(255),
upload_date VARCHAR2(10),
filesize VARCHAR2(20),
mime VARCHAR2(50),
content BFILE,
CONSTRAINT doc_pk PRIMARY KEY (id)
);
The field content is defined as type BFILE
. This means the content of the document is not stored in Oracle. This is perfectly OK, as we only need the index. The next thing we will do is: define a directory that will be used as the root folder for the documents that we want to index. Grant read access for this directory to user ctxsys.
CREATE OR REPLACE DIRECTORY documents AS 'C:\_work\datastore\';
GRANT READ ON DIRECTORY documents TO ctxsys;
Next, we create an index of type CTXSYS.CONTEXT
:
CREATE INDEX my_doc_idx ON my_doc(content) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('SYNC ( ON COMMIT)');
Finally, we create the Stored Procedure put_file
:
CREATE OR REPLACE PROCEDURE put_file
(
p_file_name IN my_doc.file_name%TYPE,
p_upload_date IN my_doc.upload_date%TYPE,
p_filesize IN my_doc.filesize%TYPE,
p_mime IN my_doc.mime%TYPE
) AS
index_name varchar2(20) := 'my_doc_idx';
BEGIN
INSERT INTO my_doc (id, file_name, upload_date,
filesize, mime, content)
VALUES (my_doc_seq.NEXTVAL, p_file_name, p_upload_date,
p_filesize, p_mime,
BFILENAME('DOKUMENTE',p_file_name));
COMMIT;
That's it - we are finished creating the Oracle objects. Now we can test the solution with SQL*Plus:
exec put_file('test.doc', '10.10.2005', '100', 'doc');
When you execute the above statement, be sure that the file test.doc exists in the root folder of the indexed directory. You should also check for errors in the table ctx_user_index_errors using the following statement:
SELECT err_index_name, err_timestamp,err_text FROM
ctx_user_index_errors ORDER BY err_timestamp;
When you have no errors in this table, continue with the VBScript to insert more documents into the table and the index. To check the contents of the index table, you can use the script index_stats.sql that is also in the zipped source files.
Importing the data
Next, we will create a simple VBScript file to import all the documents in a certain folder. First, we open the database connection and get access to the root directory. We insert each file from the directory into the table my_doc using the procedure put_file
. In the end, we close the database connection and cleanup all used objects. Here is the complete source of the VBScript file:
Dim strFolder: strFolder = "C:\_work\datastore\"
Dim strCon: strCon = _
"Driver={Microsoft ODBC for Oracle}; CONNECTSTRING=ORCL; uid=system;pwd=man;"
Dim oCon: Set oCon = WScript.CreateObject("ADODB.Connection")
oCon.Open strCon
Dim oFs: Set oFs = WScript.CreateObject("Scripting.FileSystemObject")
Dim oFolder: Set oFolder = oFs.GetFolder(strFolder)
For Each oFile In oFolder.Files
WScript.Echo oFile.Name
Dim strCmd:
strCmd = "put_file('" & _
oFile.Name & "','" & _
Date() & "','" & _
oFile.Size & "','" & _
oFs.GetExtensionName(oFile) & "')"
oCon.Execute(strCmd)
Next
oCon.Close
Set oFolder = Nothing
Set oFs = Nothing
Set oCon = Nothing
That is all we need to import the data. I made the code as simple as possible. You should walk down the directory structure to import an entire tree of documents. You should also probably check for errors - reading the table ctx_user_index_errors.
The user interface
As I promised in the introduction, we only need Notepad, even for the user interface. So we use an HTA as the user interface for searching in documents. To do this, we create a file called search.hta. An HTA uses the GUI-objects from Internet Explorer, but does not need a web server (like IIS) to be displayed. First, we define the head-section:
<html>
<head>
<title>Indexed files</title>
<HTA:APPLICATION
ID="objTest"
APPLICATIONNAME="HTA Test"
SCROLL="yes"
SINGLEINSTANCE="yes"
>
</head>
Please visit the Microsoft website http://www.microsoft.com/technet/scriptcenter/hubs/htas.mspx as the starting point for more information about HTA. Next, we need a script section to perform the search and display the results within the HTA page.
<SCRIPT LANGUAGE="VBScript">
Sub StartQuery
Dim strCon: strCon = "Driver={Microsoft ODBC for Oracle}; " & _
"CONNECTSTRING=ORCL; uid=system;pwd=man;"
Dim oCon: Set oCon = CreateObject("ADODB.Connection")
Dim oRs: Set oRs = CreateObject("ADODB.Recordset")
oCon.Open strCon
Dim strSql, strRes
strSql = "SELECT SCORE(1), file_name, filesize FROM my_doc WHERE CONTAINS(content," & _
Search.Value & ", 1) > 0 ORDER BY SCORE(1) DESC"
strRes = "<table border=1 cellpadding=10 cellspacing=0><tr><th>found" & _
"</th><th>title</th><th>size</th><th>action</th></tr>"
Set oRs = oCon.Execute(strSql)
While Not oRs.EOF
strRes = strRes & "<tr><td>" & _
oRs.Fields(0).Value & "</td><td>" & _
oRs.Fields(1).Value & "</td><td>" & _
oRs.Fields(2).Value & "</td><td><a href='datastore/" & _
oRs.Fields(1).Value & "' target=_blank>Open in new windows</a>" & _
"</td></tr>"
oRs.MoveNext
Wend
oCon.Close
strRes = strRes & "</table>"
DataArea.InnerHtml = strRes
Set oCon = Nothing
End Sub
</SCRIPT>
Here are some points of interest in the above code:
strSql = "SELECT SCORE(1), file_name, filesize FROM my_doc " & _
WHERE CONTAINS(content," & Search.Value & ", 1) > 0 " & _
ORDER BY SCORE(1) DESC"
This is a query performing a full text search on the content of the files. For example: the where
condition CONTAINS(content, 'Perl', 1)
selects only documents that contain the string Perl and gets the number of hits in the document. The variable Search.Value
is the value of a text box defined later in the body section of the HTA.
DataArea.InnerHtml = strRes
When we are finished formatting the results using HTML tables, we assign the table to 'DataArea.InnerHtml
'. The object DataArea
is also defined in the body section of the HTA. Here, we see the body section (I removed some formatting code like tables and pictures) to keep things easy:
<body>
<input type="text" name="Search" size="60">
<input type="button" value="Start Query"
name="startQuery" onClick="StartQuery"><p>
<p>
<span id = "DataArea"></span>
</body>
In the body section, we define a text area named Search
. The content of Search.Value
acts as the input for the Select
statement. Next, we define a button with caption Start Query with an event handler that calls the subroutine StartQuery
as defined in the script section. Lastly, we need a DataArea
. We assign the formatted output of the query to 'DataArea.InnerHtml
'.
Now we are done - you can start the search interface with a double click from Explorer. See some examples of valid searches in the GUI. For more information on searching in documents, have a look at: http://oraclesvca2.oracle.com/docs/cd/B13789_01/text.101/b10729/toc.htm.
Points of interest
As you can see, it is very simple to create an index for your office documents using Oracle Text. But in this solution, we have only a snapshot of the document contents. What happens when you change a document? The index for this document will no longer be valid. So we should take care of updating the index when a document has changed. I already have an idea of how to solve this problem.