Introduction
This article
describes the process of downloading and inserting XML file into a database and
automating the task using two different approaches, i.e., SQL Server Jobs and Windows Task Scheduler.
Background
National Oceanic and Atmospheric
Administration (NOAA)
provides a wide range of public web services which can be accessed
programmatically. Specifically, this link [1] got my attention. NOAA provides access to observed weather
conditions across 1800 locations mostly airports (organized by a station ID) in
the United States. Various weather sensors record the values such as barometric
pressure, wind speed, humidity, temperature, and so on at these locations. The
observations are bundled into a XML file primarily targeted for machine to
machine transfer [2]. For each hour, RSS and XML files are generated for every
observation station. The suggested pick up time is 15 minutes after the hour.
This scenario presented an opportunity to automate the task of downloading the
new XML file which is getting updated every hour. The next section describes
the overview of the application.
Overview
The process is divided into four steps as shown below in
figure 1.
Before I jump into the implementation details, would like to
discuss the boundaries on which the application is designed. The XML file generated
by NOAA is about ~2.25 KB. Normally there should be 24 new files every day
because the file is getting updated every hour but what I observed is file
doesn’t change for hours together. Check out this
link [3] to see the XML file used for the application and I was
interested in the following nodes: ObservationTime, Temp_F, Temp_C, Relative
Humidity, Wind MPH, pressure_mb, dewpoint_F.
The main logic is embedded into the WeatherLogger.exe code
which is step two. This is a console based C# application developed in Visual
Studio 2012 with .NET 4.5 framework. In the next section I will go over the
design strategies incorporated. From the beginning I
wanted to store the values from the XML file to a table in database. For step
three, I wrote two stored procs and created two tables called “Observations”
and “AppErrors” in SQL Server 2012. For the final part, step four, the automation is taken care
in two ways. I used windows task scheduler and SQL Server Job to run the
WeatherLogger.exe every one hour to pick the new file. Both approaches are good
and better when compared with windows service application.
The source code is on GitHub, check out this link [4] Feel free
to pull the code, try it out and I appreciate criticism on the code. If there
is room for improvement, would like to know. The next section describes the
implementation details of step two.
Weather Logger
This was my initial approach which I scrapped. The heavy lifting was done by the console app
and the database portion was light weight. I thought of poling the NOAA server
periodically, download the XML file, parse it, remove the unused nodes and
insert the nodes required into a table. I was able to achieve all of it but the
tricky part was polling the server at a regular interval. I battled through the
Last-Modified and If-Last-Modified http headers [5]. Tuned a timing belt which
runs at the suggested pick time. Although the suggested pick up time was 15
minutes after the hour, the pickup time was not reliable. This messed up my
timing belt, then I resorted for a second option.
For the
second option I decided to go database intensive and keep the console app light
weight. I heard lot of good things about the XML data type and the merge
statement in SQL Server and decided to use it. This choice significantly
reduced the code size and solution became elegant. The main logic is bundled into a stored
procedure CheckAndInsert which takes @xml
file as a parameter.
The idea is convert the XML/URI into C# string and call the
strod proc by passing the string as an argument. The values from the required
nodes are inserted into a #staging table which is identical to the main table
which holds the observations data.
select
current_observation.value('(observation_time)[1]',
'varchar(50)') as ObsTime,
current_observation.value('(temp_f)[1]', 'float') as
Temp_f,
current_observation.value('(temp_c)[1]', 'float') as
Temp_c,
current_observation.value('(relative_humidity)[1]', 'float')
as RelHum,
current_observation.value('(wind_mph)[1]', 'float') as
Wind_mph ,
current_observation.value('(pressure_mb)[1]', 'float')
as Pressure_mb,
current_observation.value('(dewpoint_f)[1]', 'float')
as Dewpt_f
into #Staging
from
@Raw_Xml.nodes('current_observation') as rawdata(current_observation)
Then use merge tsql command by choosing the Observations as target table and #Staging as
source table. Comparison is done on the observation time, which is most
reliable column. If not matched then the data from the XML file is inserted
into the database. Finally the #Staging table is dropped.
;merge into Observations
using #Staging
on #Staging.ObsTime = Observations.ObsTime
when not matched then
insert (ObsTime,Temp_f,Temp_c,RelHum,Wind_mph,Pressure_mb,Dewpt_f)
values
(ObsTime,Temp_f,Temp_c,RelHum,Wind_mph,Pressure_mb,Dewpt_f);
This
approach is scalable and suitable to this scenario as the size of the XML is
really low, the execution time for merge statement is very fast. There is no
need for storing or parsing the XML file. This stored procedure is called in
the InsertData
method from the InsertIntoDb
class. The class diagram for the
console app is shown in figure 2.
There are three main classes in the application, DownloadXML
,
InserIntoDb
, and LogErrors
, the Program
class orchestrates the flow of the
application and it is the entry point. The XML data type in SQL Server is quite
powerful, this data type reads XML file as blob and in C# it should be well formatted
XML file stored as a simple C# string.
The DownloadWeatherXML
method in the DownloadXML
uses the
WebClient
class to download the requested url as a string.
string xml = new WebClient().DownloadString(_url);
string modifiedxml = xml.Replace(" \r\n", "");
If I directly pass this string without replacing the
encoding then I got the following error from the sp_CheckAndInsert stored
procedure. I was unable to resolve it, but it works if I replace the encoding.
"XMLparsing: line 1, character 43, unable to switch the encoding"
LogErrors
is a generic helper class and has a method InsertErrors()
. This method
is used for recording any errors that are caught by any catch block throughout
the code. The data is inserted by InsertIntoAppErros
stored procedure. I was
compelled to use this class after reading the book Joel on Software [6].
Automation
The WeatherLogger.exe has to run every hour, which calls in for an automation. I
wanted to work with both the SQL Server Job and Windows Task Scheduler. I took screenshots
on how to setup the job, check the links below, uploaded them on flickr
Scheduling SQL Server Job
<l>Windows Task Scheduler
Results
I ran the scheduled jobs and task on two different days for about couple hours, below is the screenshot of the observations table.
Conclusion
Overall, I personally feel this approach is
scalable, the choice of executing the exe can be delegated to the system admin
or a database administrator. I have done XML parsing in C# but ease provided by
the merge statement and XML data type from SQL Server is unbeatable. For future, I would like to set up a SSRS web
service and render the report for a client using the observations data. Also,
would like to implement the complete process as SQL CLR stored procedure.
References
[1] http://w1.weather.gov/xml/current_obs/
[2] http://products.weather.gov/PDD/NWS_Current_Observations_RSS_XML.pdf
[3] http://w1.weather.gov/xml/current_obs/KEWR.xml
[4] https://github.com/tkmallik/WeatherLogger.git
[5] http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.ifmodifiedsince.aspx
[6] http://www.amazon.com/Joel-Software-Occasionally-Developers-Designers/dp/1590593898
[7] http://www.flickr.com/photos/90662753@N08/sets/72157632124494853/
[8] http://www.flickr.com/photos/90662753@N08/sets/72157632128725180/