Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Blazing Fast Source Code Search in the Cloud

7 Apr 2015 2  
This blog post shows how you can leverage dtSearch to perform fast searches of data safely stored in the Microsoft Azure cloud.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

Using dtSearch and the techniques in this article will make your data searches lightning fast, making it possible to search terabytes of data with sub-second response time.

But first, two preliminary notes about this blog post. (1) The blog post describes source code data, but the same approach would apply to other data stored in the Microsoft Azure cloud: HTML, XML, MS Office documents -- even email data. (2) While the data in this blog post resides in the Microsoft Azure cloud, the indexes are on a local PC. A subsequent article will address data and indexes in the cloud.

Here is a workplan of our overall project:

Overall Workplan

In part one of this article we are going to go to the Azure portal and provision the storage account. Naturally, the assumption is that you have signed up for an Azure account. If you have not, it's relatively easy to sign up for a free trial, so you can see if it meets your needs before you commit your money.

Once you provision your storage account, access keys will be automatically generated. These access keys will be copied into our Visual Studio project, because they are the secret keys that give privileged access to your storage account, the place where we’re going to copy the source code to be indexed and later searched.

Part two of this article will show you where we can get the Visual Studio solution with the starter code. This solution will dramatically reduce the amount of work we actually have to do to implement this useful source code searching application. If you install the full edition of the dtSearch Engine, the starter project actually gets installed in your program files folder.

We will be using Visual Studio 2013 with the latest updates. We will also install the latest Azure Storage SDK binaries.

It's in part three where the real work starts. What we want to do here is build the capability to upload your source code into your storage account. There are various utilities that you can download to perform the task of uploading source code to your storage account, but it will be far more convenient if we can build this into our main searching application. Once we finish this retrofit and upgrade, we can then run the application to upload the source code, index it, and then move to part four of our work plan.

Part four will be fast and easy because we will be pretty much done with the difficult work. Part four is about testing and packaging our application. The index files that get generated could be copied to other client computers. That means we can copy the application along with the generated index files to any computer to perform lightning fast source code searches.

Part 1 - Provisioning at the Azure portal

Provisioning the storage account is actually quite simple. At the time of this writing the traditional Azure portal is the place to go. But after the first week of May 2015, Microsoft will release the new portal.

Portal to 5/5/2015 http://manage.windowsazure.com/
New Portal after 5/5/2015 http://portal.azure.com/

Once you log into the Azure portal, it's a simple matter of navigating to the STORAGE menu item and clicking NEW.

Provisioning a storage account at the Azure portal

A QUICK CREATE menu item will become visible. Click on that to continue.

At this point you are ready to provide the URL, location, and the replication mode. The URL you come up with needs to be globally unique. As you can see "mysourcecode" was not taken. I chose "East US" for my location, but you can choose from among the world’s data centers. A closer data center means lower latency. You can read about replication options here: http://blogs.msdn.com/b/windowsazurestorage/archive/2013/12/11/introducing-read-access-geo-replicated-storage-ra-grs-for-windows-azure-storage.aspx.

When you are done, click CREATE STORAGE ACCOUNT in the lower right corner. It should take less than five minutes to provision your storage account. It took less than a minute for me when I did it.

Creating your storage account

When the portal indicates that your storage account is ONLINE, you are ready to move forward. Click on the small arrow that's pointing right to drill into the details of this newly provisioned storage account.

The provisioned storage account

You are now ready to copy access keys to the clipboard. Click on MANAGE ACCESS KEYS.

Copying the Access Keys

Click on the icon of the red box to copy the PRIMARY ACCESS KEY into your clipboard and store it in a safe place along with the STORAGE ACCOUNT NAME. Both your STORAGE ACCOUNT NAME and your PRIMARY ACCESS KEY will be different from what you see here.

Copying the Storage Account Name and the Primary Access Key
Storage Account Name mysourcecode
Primary Access Key CnQ6dUXdOQ81qSCFJhscuB3PCNM92o4bIuDoKG7mO
7tJ1imxa5sMkzKtnghsG11EwKgxRaTW5g6fFKRcXZ8z6g==

Part 2 - Locating the starter project

The starter project that ships with the dtSearch Engine can be found under the program files folder here:

  • C:\Program Files (x86)\dtSearch Developer\examples\cs4\AzureBlobDemo\AzureBlobDemo.sln

The starter project provides an excellent starting point for us to begin our work. Be sure you are using Visual Studio 2013 with all the latest updates installed.

The project should open up seamlessly, but we want to be sure we have the latest Azure Storage binaries installed. We will right-click in Visual Studio's Solution Explorer and select Manage NuGet Packages.

Adding NuGet Packages

In the upper right search box, type in "Azure Storage." As you would expect, this brings up the Windows Azure Storage client library, which we are going to use to read and write from and to the Windows Azure Storage account that we will provision momentarily.

Updating the project with latest Azure Storage SDK

In Visual Studio Solution Explorer you can expand the references node to validate that we have the storage client libraries installed.

Validating the Azure Storage Client Libraries

Part 3 - Adding the storage account connection information to app.config

Now is a good time to copy the storage account information into your app.config file. The app.config file provides a convenient location that is globally accessible to your application. It will be accessed at run time. It is not appropriate to ask users to continually provide the connection information every time they use the application.

Modifying App.Config

<?xml version="1.0"?>
<configuration>
  <startup>
    <supportedRuntime

        version = "v4.0"

        sku = ".NETFramework,Version=v4.0"/>
  </startup>
  <appSettings>
    <add

        key = "StorageAccountName"

        value = "mysourcecode"/>
    <add

        key = "AccessKey"

        value = "CnQ6dUXdOQ81qSCFJhscuB3PCNM92o4bIuDoKG7mO7tJ1imxa5sMkzKtnghsG11EwKgxRaTW5g6fFKRcXZ8z6g=="/>
  </appSettings>
</configuration>

Options for encryption

If you would like to encrypt this information, there are several options here:

Adding support to upload source code to your Azure Storage Account

Our next task is to enhance the starter project to enable source code uploads. Adding this capability directly into the application will dramatically improve usability. In this section, we will add a command button and then write some code.

Here's what the application looks like before our changes. This is MainForm.cs.

Before Adding a button to MainForm.cs

We will now add a third button as seen below. The name of the button is cmdAddCode and the caption reads (Text Property) Add source code to Azure Storage. You will need to move the index and search buttons down a little bit to make room for this new third button.

From the designer, click on the Add source code to Azure storage button to retrieve the code.

After Adding a button to MainForm.cs

We will now add some code that will provide the ability to upload source code.

Adding Code-Behind

Repeat the steps from an earlier step to ADD A REFERENCE. The reference we will add is System.Configuration. Be sure you have the check box inside the red box checked before clicking OK.

Adding a reference to System.configuration

Be sure that the top of MainForm.cs has the following new statements in place.

The necessary using statements

Modifying MainForm.cs

private void cmdAddCode_Click(object sender, EventArgs e)
{
    string windowTitle = this.Text;
    try
    {
        string selectedFolder = null;
        FolderBrowserDialog fDialog = new FolderBrowserDialog();
 
        //  if the user has clicked the OK button after choosing a file,To display a MessageBox with the path of the file:
        if (fDialog.ShowDialog() == DialogResult.OK)
        {
            selectedFolder = fDialog.SelectedPath.ToString();
        }
 
        string storageAccountName = ConfigurationManager.AppSettings["StorageAccountName"];
        string accessKey = ConfigurationManager.AppSettings["AccessKey"];
        string connString = string.Format("DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}",
            storageAccountName, accessKey);
 
        // Parse the connection string and create a client
        var storageAccount = CloudStorageAccount.Parse(connString);
        CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
 
        List<FileInfo> filesToUpload = new List<FileInfo>();
        RecursiveFileUpload(selectedFolder, filesToUpload, "*.*");
        var fileUploadParallelism = new ParallelOptions() {MaxDegreeOfParallelism = 4};
 
        string blobContainerName = "code";
        blobClient = storageAccount.CreateCloudBlobClient();
        CloudBlobContainer container = blobClient.GetContainerReference(blobContainerName);
        container.CreateIfNotExists();
 
        Parallel.ForEach(filesToUpload, fileUploadParallelism, currentFileInfo =>
        {
            // Fix up the file path so it works with a blob path
            string cloudFileNamePath = currentFileInfo.FullName.Replace(@"\", @"_");
            cloudFileNamePath = cloudFileNamePath.Length == 0 ? "" : cloudFileNamePath;
            if (cloudFileNamePath.Length > 0)
            {
                if (cloudFileNamePath.Substring(0, 1).Equals("/"))
                {
                    cloudFileNamePath = cloudFileNamePath.Substring(1);
                }
            }
            try
            {
                var blobFileToUpload = container.GetBlockBlobReference(cloudFileNamePath);
                ShowTitle("Uploading..." + currentFileInfo.Name);
                if (!blobFileToUpload.Exists())
                {
                    blobFileToUpload.OpenWrite(null, null, null);
                    blobFileToUpload.UploadFromFile(currentFileInfo.FullName, FileMode.Open, null, null, null);
                }
            }
            catch (Exception exception)
            {
                MessageBox.Show("Issue with  blob upload = " + exception.Message.ToString());
            }
 
        }
        );
 
    }
    catch (Exception ex)
    {
        throw;
    }
    finally
    {
        this.Text = windowTitle;
    }
}
delegate void StringParameterDelegate(string value);
public void ShowTitle(string value)
{
    if (InvokeRequired)
    {
        // We're not in the UI thread, so we need to call BeginInvoke
        BeginInvoke(new StringParameterDelegate(ShowTitle), new object[] { value });
        return;
    }
    // Must be on the UI thread if we've got this far
    this.Text = value;
}
private List<FileInfo> RecursiveFileUpload(string sourceDir, List<FileInfo> filesToCopy, string search_type)
{
    DirectoryInfo sDirInfo = null;
    FileInfo sFileInfo = null;
    if (!(sourceDir.EndsWith(Path.DirectorySeparatorChar.ToString())))
    {
        sourceDir += Path.DirectorySeparatorChar;
    }
    try
    {
        foreach (string sDir in Directory.GetDirectories(sourceDir))
        {
            sDirInfo = new DirectoryInfo(sDir);
            RecursiveFileUpload(sDir, filesToCopy, search_type);
            sDirInfo = null;
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show("Issue with  RecursiveFileUpload " + ex.Message.ToString());
    }
 
    try
    {
        string[] theFiles = Directory.GetFiles(sourceDir);
        foreach (string sFile in theFiles)
        {
            if (sFile.Length >= 1024)
                continue;
 
            sFileInfo = new FileInfo(sFile);
            try
            {
                filesToCopy.Add(sFileInfo);
            }
            catch (System.IO.IOException ex)
            {
                MessageBox.Show("Skipping " + sDirInfo.FullName + " because of " + ex.Message.ToString());
            }
            sFileInfo = null;
        }
 
    }
    catch (System.UnauthorizedAccessException ex)
    {
        MessageBox.Show("Skipping " + sourceDir + " because of " + ex.Message.ToString());
    }
    catch (System.Exception ex)
    {
        MessageBox.Show("Skipping " + sourceDir + " because of " + ex.Message.ToString());
    }
    return filesToCopy;
}

Some of the code needs updating in the Rewind() method of the BLOBDATASOURCE.CS file.

// Fixes for BlobDataSource.cs
//
public override bool Rewind()
{
    // Check connection interaction success flag.  If an earlier attempt to
    // connect to the storage failed, then method will not be successful.
    if (_isStorageFailed)
        return false; // failure code - no documents to read           
 
    // Setup the connection to Windows Azure Storage
    try
    {
        // Parse the connection string and create a client
        var storageAccount = CloudStorageAccount.Parse(_connectionString);
        _blobClient = storageAccount.CreateCloudBlobClient();
 
        // Create (or re-create) the blob table
        _blobTable = new Dictionary<string, List<string>>();
 
        // Add all files into the blob table using the container name as the key
        foreach (CloudBlobContainer container in _blobClient.ListContainers())
        {
            // Get the BlobTable key: the container name
            string containerName = container.Name;
 
            // Get the BlobTable value: a list of blob URIs
            List<string> blobURIs = new List<string>();
 
            //List blobs and directories in this container
            var blobs = container.ListBlobs();
 
            // FIX: Used to be foreach (CloudBlob blob in container.ListBlobs())
            foreach (var blobItem in blobs)
            {
                blobURIs.Add(blobItem.Uri.ToString());
                //System.Diagnostics.Debug.WriteLine(blobItem.Uri.ToString());
            }
 
            // Add the entry to the BlobTable                          
            _blobTable.Add(containerName, blobURIs);
        }
 
        // Initialize iterators; fail if not successful
        if (!ResetIterators())
        {
            _isStorageFailed = true;
            return false;
        }
 
        // Set success
        _isStorageFailed = false;
        return true;
    }
    catch (Exception ex)
    {
        // Add diagnostic code here if desired
 
        // Set failure
        _isStorageFailed = true;
        return false;
    }
}

We have made some modifications to AskConnectForm.cs.

This will always retrieve the connection string so that the user doesn't have to type it in continually. Ideally, we could write some code to completely bypass the AskConnectForm form, but I'm trying to avoid too many modifications to keep this post straightforward.

public AskConnectForm()
{
    //
    // Required for Windows Form Designer support
    //
    InitializeComponent();
    // Add the code below
    string storageAccountName = ConfigurationManager.AppSettings["StorageAccountName"];
    string accessKey = ConfigurationManager.AppSettings["AccessKey"];
    string connString = string.Format("DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}",
        storageAccountName, accessKey);
 
    this.ConnectString.Text = connString;
}

Part 4 - Testing

We are now ready to start testing the application that we just updated. One thing that might be of interest is to verify that we correctly updated our storage account with the source code. I ran the application once and uploaded source code to Azure Storage, as seen in the picture below.

You can download the Azure Storage Explorer for free at the following URL:

http://azurestorageexplorer.codeplex.com/

Once you've installed and configured Azure Storage Explorer, you can go and browse the containers for whatever source code you may have previously uploaded. It also allows you to delete the content should you want to do so.

Tools like Storage Explorer

Although we are adding source code, you can pretty much add any file, whether those are Word documents or PowerPoint. dtSearch will automatically index many different types of documents.

By the way, the previous code performs the upload asynchronously, and the developer can control the level of concurrency depending on network and system resources.

See the code snippet:

var fileUploadParallelism = new ParallelOptions() {MaxDegreeOfParallelism = 4};

Click the highlighted button to add source code up to your Azure Storage account.

Adding code

You can repeat this process of selecting a folder that contains the source code you wish to upload. All the files in the folder (and sub-folders) you pick will also be used to populate Windows Azure Storage with source code.

Selecting a folder that contains source code

When the index is created it will need a location to store the index files.

Enter a valid location and then hit the Index button.

Entering information about the index and creating the index

We already entered the necessary code above to populate this dialog box with the appropriate connection string. You can just hit OK on this dialog box.

Entering the connection string and clicking OK

You will click on two buttons in this dialog box. The first button is Index an Azure storage account. The second button is Search.

Performing the indexing operation and then clicking Search

Entering the search term and hitting Search

Our work is complete. You are now able to get lightning quick results searching your keywords up against your Azure Storage account.

If you decide to add more source code to the Azure Storage account, you will need to regenerate the indexes.

Viewing the search results

Conclusion

You can now search literally terabytes of source code and get instant search results. One of the core advantages here is that you don't have to store all the source code locally on your own laptop or desktop computer. All the source code can be securely stored up in your Azure Storage account, available only to those that have the access keys.

Other Resources

More on dtSearch
dtSearch.com
A Search Engine in Your Pocket – Introducing dtSearch on Android
Blazing Fast Source Code Search in the Cloud
Using Azure Files, RemoteApp and dtSearch for Secure Instant Search Across Terabytes of A Wide Range of Data Types from Any Computer or Device
Windows Azure SQL Database Development with the dtSearch Engine
Faceted Search with dtSearch – Not Your Average Search Filter
Turbo Charge your Search Experience with dtSearch and Telerik UI for ASP.NET
Put a Search Engine in Your Windows 10 Universal (UWP) Applications
Indexing SharePoint Site Collections Using the dtSearch Engine DataSource API
Working with the dtSearch® ASP.NET Core WebDemo Sample Application
Using dtSearch on Amazon Web Services with EC2 & EBS
Full-Text Search with dtSearch and AWS Aurora

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here