Introduction
This is the first in a series of articles on refactoring Lucene.NET to follow .NET best practices and conventions rather than Java's coding styles and limitations. These articles will be available at The Code Project that follow the process from beginning to end.
I am assuming that you are familiar with Visual Studio 2010, and will not be going into detail on how to setup projects, add references and similar tasks.
Articles in this series:
- Aimee.NET - Refactoring Lucene.NET: Setting up the Project
- Aimee.NET - Faster Unit Tests and Refactoring the Documents Folder
Background
I have been investigating the possible use of Lucene.NET to improve the Search experience for The Code Project community.The Lucene library is an amazing piece of design and implementation for indexing and searching content. It is very easy to use, is very fast, and provides excellent search results. It's just not a design and implementation that uses patterns and classes familiar to .NET developers, and therefore does not get much developer support.
Recently, I read a blog, Lucene.Net needs your help (or it will die), warning that the Lucene.NET project is in danger. This is due to the fact that there has been very little activity on the project for many months. As I am currently in the process of incorporating Lucene.NET into The Code Project, I have reviewed the source code and think I understand why this is.
- First of all, the code does not follow .NET conventions and best practices.
- The code is an almost direct copy of the Java source.
- As such, it does not use Properties, it has getter and setter methods.
- It does not use enums, rather uses pattern which emulates enums using classes.
- The unit tests take way too much time, rendering them useless for TDD.
- The code does not take advantage of the .NET Frameworkâs library of classes that should be leveraged to improve the performance and code quality.
- The inline documentation is not in XML Documentation format.
For these reasons, I asked my boss, Chris Maunder, permission to spend some time to clean up the code for the .NET developer community, and hopefully with the help of the community.
The intent is to produce a code base that is better suited for use and improvement by the .NET developers. Conversion will include all files in the 1032168 Revision in the Lucene.NET SVN Repository.
Project Goals
- Provide the Code Project Community with the best search experience possible.
- Provide a means for the Community to contribute to this improvement with a public project.
- Provide a better .NET version of the Lucene library for use by the .NET community.
Using the Code
The code is being stored on CodePlex. At various points in the refactoring I, and hopefully a team of CodeProject community members, will be releasing stable versions under the Downloads tab. If you want to get the latest work in progress, all checkins are available under the Source Control tab.
Starting the Project
Choose a Name
The first step in this process was to choose a name for the project, as Lucene is copyrighted by Apache.org.
To keep the name in the Code Project family, I choose Aimee.NET. Aimee is one of the members of our Sales and Marketing team. She has been with The Code Project longer than I have. She has filled the shoes of Receptionist, Account Assistant, Ad Operations, and her current role. As such she is known by all here at The Code Project and knows where all the skeletons are hidden.
Create a Codeplex Project
Creating a project on CodePlex is a very simple process. Just go to the CodePlex homepage, and click on the Create Project button. This will take you through the steps of setting up your project. Since I've already done this, I will not explain this step by step.
Populating the Source Control Repository
I obtained a copy of Lucene.NET from the official Lucene.NET SVN Repository. This include the source for Lucene.NET, Demos, Contributed Extensions, and Unit Tests. The next step was to check this into the CodePlex Source Code Control.
CodePlex used Team Foundation Server 2010 for its source and project management. At home, I use the TFS client in Visual Studio 2010 for my CodePlex development. At work, we use SVN. Fortunately, CodePlex supports both clients when accessing Source Control, so I don't have to switch between configurations at work.
After several attempts, I managed to get the source into the repository in a manner I was happy with. A little warning, don't do your initial checking at 3 in the morning. You need to think about what should go where before you do it.
Converting to .NET 4 and VS 2010
Once I had a good starting point, I created a Visual Studio 2010 solution file and added the various Lucene.NET projects to it. These project are Visual Studio 2005 projects and VS2010 brings up it Conversion Wizard to update the projects to VS2010 format.
To convert each project to .NET 4, I open the Property Page for each project and set the Target Framework to .NET Framework 4.
I then changed all the dependencies between projects to be Project Dependencies.
Lucene.NET has two external dependencies for the unit tests, the SharpZipLib and NUnit (version 2.5.8). I've placed these in a Lib directory and added a Solution Folder called Dependencies. The required DLLs have been added to this directory. I then changed the references to use the DLLs in the directory.
A little fiddling, and the solution compiles. I checked it back in. You can see this on the right.
Getting the Tests to Run
I'm using two test runners for executing unit tests, the NUnit GUI, and the CodeRush test runner. There are differences in how they determine the AppDomain Base Directory when executing the tests. I modified the Test.nunit config file to set the base directory to be the bin\debug directory, which makes it compatible with older versions of NUnit and the CodeRush test runner.
Additionally, the unit tests have some issues with the System Temp Directory in Windows 7, so I modified:
- the
TestBackwardsCompatibility.UnZip
method to explicitly use the AppDomain.BaseDirectory
- and the
_TestUtil.GetTempDir
method to use a "temp" directory in the bin/debug directory. - Changed all occurrences of
System.IO.Path.GetTempPath
to calls to _TestUtil.GetTempDirName
, a new method that just returns "temp
".
I fixed a bug in the Test_Search_FieldDoc
which caused an infinite loop if an exception was thrown during the test.
In addition, Index.TestTransactions.TestTransactions_Renamed
was not terminating. Presumably, it is waiting for all the background threads the test created to end. I've added an [Ignore]
attribute so the tests will run. I'll get back to this later.
Store.TestDirectory.TestDirectInstantiation
is aborting on the constructor for MMapDirectory
. Added an [Ignore]
attribute as it stops the test run. I will fix this later. This issue also exists for Store.TestWindowsMMap.TestMmapIndex
. It is also set to [Ignore]
.
Similarly, the Messages.TestNLS.???
tests appear to be failing because my culture is en-ca not en-us. I'll also look into this later.
The Search.Function
tests were running slow. A quick review showed that the Test setup method was re-creating a large collection which should have been created in the Fixture setup method.
Once all the tests run, check everything back in. We are now confident that we have a project that compiles and function as expected, except for a few minor issues.
Speeding up the Tests
Now that we have a set of unit test, we can try and make them useful. Currently, the execution time is too long to be useable for TDD style work, which is really necessary for Refactoring. We need to know at each Refactoring step that we have not broken any functionality.
Running the unit tests in the NUnit GUI results in the following status.
This is a duration of 45 minutes on a fairly powerful machine. This is clearly unacceptable for TTD style development or refactoring so the test need to be improved and a meaningful subset must be selected to get the total duration to under a minute.
To determine what tests need performance improvements, I saved the test results into an XML file, which you can get here. This file includes execution times for each test and group of tests. I've found that LinqPad is a great tool for analyzing data from almost any source. Entering the following code into LinqPad and executing the code results in two tables. The first is all test cases that take more than 5 seconds to run, sorted by time descending. The second is all test suites that take more than 10 seconds to run, sorted by time descending.
var doc = new XmlDocument();
doc.Load(@"C:\Users\matthew.CODEPROJECT\Documents\Work\Aimee.Net\Article1 -
Starting the Project\TestResult.xml");
var testcases=doc.SelectNodes("descendant::test-case");
var testdata = (from XmlNode c in testcases
where c.Attributes["executed"].Value == "True"
select new {
Suite = c.ParentNode.ParentNode.Attributes["name"].Value,
Name = c.Attributes["name"].InnerText,
Time = float.Parse(c.Attributes["time"].Value)})
.OrderByDescending (d => d.Time);
testdata.Where(td => td.Time >= 5).Dump();
testdata.GroupBy(td => td.Suite)
.Select ( grp => new { Suite = grp.Key, Time = grp.Sum(td => td.Time)})
.Where( ts => ts.Time >= 10)
.OrderByDescending(Su=> Su.Time)
.Dump();
This allows me to look at the tests suites and cases with the longest durations. I will tackle some of these and report back in my next article on the changes I make in my quest for a set of tests that will allow a TDD approach to the refactoring.
History
- November 18, 2010 First release
- December 8, 2010 - Added links to other articles in series