(untagged)

Lightening Fast Troubleshooting for Your Cloud Applications

Jon Rooney

6 Mar 2013

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

Cloud service providers like Microsoft® Windows® Azure, Amazon® Web Services, Heroku®, Rackspace® and Google® App Engine make it fast and easy to deploy and run applications in the cloud. However, running applications in the cloud comes with a potential loss of visibility and control. While logs and metrics can provide some insight into how your application is doing, this data is often in different formats scattered in different places, making it difficult to gain full visibility into your operations.

How can you best troubleshoot, monitor and proactively manage your applications in the cloud?

The Solution

Splunk Storm provides a technology-agnostic approach to monitoring and managing cloud applications. Harness data from every tier of your application and trace transactions across multiple hops, correlate application events with infrastructure or user experience issues and proactively prevent outages from impacting the business. Splunk Storm, delivering the industry-leading Splunk software as a service, indexes and stores machine data in real time from virtually any source, format, platform, or cloud provider without the need for custom parsers or connectors. Whether your application is written in Ruby, Java, Python, PHP, .NET, Node.js or any other language or framework, send data to Splunk Storm using network streams such as syslog, using a REST API, or via the universal forwarder for indexing and searching.

Let’s Walk Through the Example

We’ll put some sample Apache access and MySQL logs into Splunk Storm and see how quickly we can troubleshoot issues. First, a little setup:

Sign-up for Splunk Storm at https://www.splunkstorm.com/
Create a new project

Now add data to your new project. While you can also send data to Splunk Storm projects over TCP/UPD (including syslog), via a REST API or via a Splunk Universal Forwarder, we’ll manually upload some sample data for this exercise. First thing we’ll do is download the sample data (Apache Web server logs and MySQL database logs from a hypothetical online flower shop) from sampledata.zip.
Once you download and unzip sampledata.zip, you'll see three folders each with an Apache "access_combined" log file inside and one Mysql folder with a MySQL log file inside it.
No back in the browser, you’ll see the Inputs page
Click on Files
Click the "Upload" button
Browse to the access_combined.log file in apache1.splunk.com and choose it.
For each of the log files, choose a source type. Specifying a source type tells Storm how to parse your data, and allows you to group all the data of a certain type together when searching. When you add your own data to Storm, you'll want to specify the right source type so that Storm extracts timestamps and linebreaks your data correctly.
For the Apache access logs, choose Apache web access logs and click the "Upload" button.
For the MySQL log file, choose Generic single-line data and click the "Upload" button.
Repeat the upload process until all of the sample data is in your project
Once the data is added to your project, the "Explore data" button will become enabled. click it!
Now that you have the data in a Splunk Storm project, see how quickly you can troubleshoot your applications. Let’s say you receive a call from a customer who keeps hitting a server error when he tries to complete a purchase on your company’s online flower shop. He gives you his IP address – 10.2.1.44
Everything in Splunk Storm is searchable, so you just type "10.2.1.44" into the search bar, hit enter, and you will see all of this customer's traffic to your shop.
So, you see a lot of 200 response codes, but you're only interested in errors. Filter out any event that's not a 200 success response by typing "NOT 200", narrowing down the list of events.
Notice that each of the events appears on the timeline below the search bar. And double-clicking on the bars re-runs the search over a smaller time-range. So if you drill down a bit you can get to the one-minute window when one of the errors occurred.
Knowing the time-window of one of the errors, you broaden the search to include everything that was happening in the application around the same time... and you get a handful of database errors, which are pretty good leads to the root cause of the server error our customer is seeing.
Finding the root cause of issues using production logs saves long hours of trying to reproduce bugs in a development environment so you can fix issues quickly and keep your users happy.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here