Building on My SQL Storage Engines and API’s to conduct complex analysis using SQL query language

SarmadAsghar

4.76/5 (12 votes)

29 Jan 2014CPOL10 min read

23.3K

This will provide opportunity to calculate aggregates, joins on the data retrieved from API’s which was never done.

Download ProcessInfoEngine.zip - 204.2 KB

Introduction

MySQL provides support for many storage engines which manage storage and retrieval. MyISAM and InnoDB are common. On table creation engine can be specified. A basic storage engine allows read only tables. Most engines read data from files and feed to MYSQL. On the other hand, our idea is to retrieve data from API’s and feed it to MYSQL. This will allow us to conduct complex analysis using sql on data retrieved from API’s. The idea is to write an engine for each set of related API’s but this does not mean that we have to write each engine from scratch but we are proposing a generic component which can be used by all storage engine and it will simplify the task of writing engine very much.

Background

The idea came into my mind when I was thinking about doing analysis on running processes in the system. I needed up to date list of running processes in the system on mysql console. Also I needed to conduct complex analysis like count of modules in the process. In order to provide simplistic view of data, we expose tables and write an engine which calls the API’s to retrieve data. Here API’s were already available. A conventional approach could be to create a table and insert into it row for new process and delete row for terminated process.

One drawback of this approach is that you have to insert and delete rows into the table which are time consuming.
Another drawback is that you need to trigger that logging application which inserts and delete rows from the table before querying the table
Another drawback is that even if you need the information about a single process, all processes information is maintained in table
Another drawback is that you can get conflict in process id’s in case one process was terminated but a new process with the same id came for insertion before the deletion of the terminated process.
Another drawback is the table may be locked during insertion or deletion which can create performance bottlenecks for the retrieval entity.

Instead the proposed approach avoids these problems as follows:

No need to insert and delete rows from the table
No need to trigger the logging application before querying the table
No need to get info about all processes when single process info is required
No inconsistency in process id’s as it represents a state of the system
No bottlenecks, as there are no insertion and deletion.

Why is this topic important to our audience and how will they benefit?

A specific problem which the proposed approach solves better than the conventional techniques is that we have a time series data in tables. When a table size reaches a certain limit a new table is created. The requirement was to do analysis on that data using Tableau Business Intelligence tool. The tableau generates sql queries. For convenience of the user a single table was required instead of the underlying many tables. Programmatically queries can be generated to retrieve data for certain duration and display in web pages but tableau cannot utilize those programs. One conventional technique is to replicate data in a single table but this will void the concept behind maintaining multiple tables. Another conventional technique is to use views but the query behind view need to be programmatically generated and this will void the use of view. The proposed technique exposes a single table and uses available API’s to retrieve data from multiple tables. The solution is to write an engine. The simplest engine is to feed all the data and let MySQL to filter but this is highly inefficient. The proposed storage engine builds on the power of mysql query parser. During MYSQL query parsing the where clause is converted to functions/API calls to retrieve only required data to some extent. The idea can be generalized to any other API’s without having to rewrite the core logic of generating API calls.

What are the key points the reader should remember after reading this article?

This idea will solve problems like joining data from completely disparate sources. This will provide opportunity to calculate aggregates, joins on the data retrieved from API’s which was never done. This will allow simplistic picture of data and hide the complex details of database design. This will allow us to avoid time consuming joins by combining two tables into one. This will build on already written API’s to hide the complex queries behind and provide simple tables to user for even more complex analysis.

How our Storage Engine is different from others

Now I will tell how our proposed engines are different from generally available storage engines. Blackhole engine as the name suggests is used when the applications want the information to disappear automatically. Federated storage engine provides a view of remote MYSQL table. Maria storage engine provides millisecond support. XtraDB storage engine is a replacement for innodb. Falcon aims as a better transactional storage engine. PrimeBase XT does not provide durability in ACID compliance. AWSS3 storage engine is used to access data from Amazon Web Services S3 and exposes the data as tables. What is different in our proposed engines as compared to AWSS3 is that we provide a generic component which can be used by all the engines and simplify writing a storage engine. Mdbtools storage engine can be used to move data from Microsoft Access to mysql. Ritmark FS storage allows to upload files on ftp. Q4M provides message queuing functionality.

Proposed Storage Engines

Our proposed storage engine support tables that are populated on the fly and don’t retain any data. The proposed storage engine builds on the power of mysql query parser. During MYSQL query parsing the where clause is converted to functions/API calls. The functions/APIs which don’t use global or static variables and behave like mathematical functions are the candidate for complex analysis. Those functions/APIs return data structures, xml etc. when traversed can be converted into relations (tables). Different functions can be combined to feed data to a table .The functions input parameters (not necessarily all) are passed in the where-clause and functions result is shown as table. For example a storage engine named ProcessInfoEngine is created along with following tables:

SQL

Create table RunningProcessList(
PROCESS NAME  varchar(2048),
Process ID int,
Thread count int,
Parent process ID  int,
Priority Class int,
)engine= ProcessInfoEngine;

Create table ProcessModulesList
(
Process ID int,
MODULE NAME varchar(2048),
Executable varchar(2048),
Baseaddress varchar (255),
Basesize int
) engine= ProcessInfoEngine;

When the user issues a query select * from RunningProcessList the control passes to ProcessInfoEngine which checks the table name and where clause.

C++

int ha_example::rnd_init(bool scan)
{
	DBUG_ENTER("ha_example::rnd_init");

	if(_stricmp(table_share->table_name.str,"RunningProcessList")==0)
	{
		THD *thd=this->ha_thd();
		SELECT_LEX *select_lex=&thd->lex->select_lex;
		m_processListIndex=0;
		if(select_lex->where==0||select_lex->where->RunningProcessListSize==0)
		{
			GetAllProcessList(m_processList);
		}
		else
		{
			GetProcessListSpecifiedInWhereClause(m_processList,select_lex->where->RunningProcessList,select_lex->where->RunningProcessListSize);
		}
	}
	DBUG_RETURN(0);
}

In this case where clause is empty therefore the engine takes a snapshot of all running processes using following win32 api:

CreateToolhelp32Snapshot
Process32First
Process32Next

Note: Following code is adapted from [1]

C++

BOOL GetAllProcessList(vector< runningprocess> &procList )
{
	HANDLE hProcessSnap;
	HANDLE hProcess;
	PROCESSENTRY32 pe32;
	DWORD dwPriorityClass;

	// Take a snapshot of all processes in the system.
	hProcessSnap = CreateToolhelp32Snapshot( TH32CS_SNAPPROCESS, 0 );
	if( hProcessSnap == INVALID_HANDLE_VALUE )
	{
		return( FALSE );
	}

	// Set the size of the structure before using it.
	pe32.dwSize = sizeof( PROCESSENTRY32 );

	// Retrieve information about the first process,
	// and exit if unsuccessful
	if( !Process32First( hProcessSnap, &pe32 ) )
	{
		CloseHandle( hProcessSnap );          // clean the snapshot object
		return( FALSE );
	}

	// Now walk the snapshot of processes
	do
	{
		// Retrieve the priority class.
		dwPriorityClass = 0;
		hProcess = OpenProcess( PROCESS_ALL_ACCESS, FALSE, pe32.th32ProcessID );
		if( hProcess == NULL )
		{
		}
		else
		{
			dwPriorityClass = GetPriorityClass( hProcess );
			CloseHandle( hProcess );
		}
		struct RunningProcess myRunningProcess;
		myRunningProcess.ProcessName=pe32.szExeFile;
		myRunningProcess.ParentProcessID=pe32.th32ParentProcessID;
		myRunningProcess.PriorityClass=dwPriorityClass;
		myRunningProcess.ProcessID=pe32.th32ProcessID;
		myRunningProcess.ThreadCount=pe32.cntThreads;
		procList.push_back(myRunningProcess);
	} while( Process32Next( hProcessSnap, &pe32 ) );

  CloseHandle( hProcessSnap );
  return( TRUE );
}

The engine feeds the process information rows to mysql one by one. MySQL shows the table RunningProcessList data

C++

int ha_example::rnd_next(uchar *buf)
{
	int rc;
	DBUG_ENTER("ha_example::rnd_next");
	MYSQL_READ_ROW_START(table_share->db.str, table_share->table_name.str,
					TRUE);
	if(m_processListIndex<m_processlist.size()) **field="table-">field;	  
		bitmap_set_bit(table->write_set, (*field)->field_index);
		(*field)->set_notnull();
		(*field)->store(m_processList[m_processListIndex].ProcessName.c_str(),strlen(m_processList[m_processListIndex].ProcessName.c_str()), system_charset_info);
		field++;

		bitmap_set_bit(table->write_set, (*field)->field_index);
		(*field)->set_notnull();
		(*field)->store(m_processList[m_processListIndex].ProcessID,true);
		field++;

		bitmap_set_bit(table->write_set, (*field)->field_index);
		if(m_processList[m_processListIndex].ThreadCount!=-1)
		{
			(*field)->set_notnull();
			(*field)->store(m_processList[m_processListIndex].ThreadCount,true);
		}
		else
		{
			(*field)->set_null();
		}
		field++;

		bitmap_set_bit(table->write_set, (*field)->field_index);
		(*field)->set_notnull();
		(*field)->store(m_processList[m_processListIndex].ParentProcessID,true);
		field++;

		bitmap_set_bit(table->write_set, (*field)->field_index);
		(*field)->set_notnull();
		(*field)->store(m_processList[m_processListIndex].PriorityClass,true);
		field++;

		m_processListIndex++;
		rc=0;
	}
	else
	{
		rc= HA_ERR_END_OF_FILE;
	}
	MYSQL_READ_ROW_DONE(rc);
	DBUG_RETURN(rc);
}

PROCESS NAME	Process ID	Thread count	Parent process ID	Priority base
[System Process]	0x00000000	4	0x00000000	0
smss.exe	0x00000124	3	0x00000004	11
csrss.exe	0x00000194	9	0x0000018C	13
wininit.exe	0x00000274	3	0x0000018C	13
services.exe	0x000002B4	7	0x00000274	9
lsass.exe	0x000002CC	7	0x00000274	9
explorer.exe	0x0000062C	30	0x000005CC	8

When user issues a query like select * from RunningProcessList where processid=0x00000000. One option is to feed all running processes to mysql and let mysql filter the other process id’s. Another option is to call OpenProcess API and pass processid=0x00000000 as parameter and feed only one row.

C++

void GetProcessListSpecifiedInWhereClause(vector < RunningProcess > & procList,  int *RunningProcessList,int RunningProcessListSize)
{
	for(int i=0;i < RunningProcessListSize;i++)
	{
		HANDLE hProcess;
		hProcess = OpenProcess( PROCESS_ALL_ACCESS, FALSE,RunningProcessList[i]  );
		if( hProcess != NULL )
		{
			char lpExeName[MAX_PATH];
			DWORD dwSize=MAX_PATH;
			dwSize=QueryFullProcessImageName(hProcess,0,lpExeName,&dwSize);
			if(dwSize==0) continue;
			ULONG_PTR ppid=GetParentProcessId(hProcess);
			DWORD dwPriorityClass;
			dwPriorityClass = GetPriorityClass( hProcess );
			struct RunningProcess myRunningProcess;
			basic_string < char > str=lpExeName;
			basic_string < char   >::size_type index;
			static const basic_string < char > ::size_type npos = -1;
			index=str.find_last_of('\\');
			if ( index != npos )
			{
				myRunningProcess.ProcessName=str.substr(index+1,str.length()-index);
			}
			else
			{
				myRunningProcess.ProcessName=str;
			}
			myRunningProcess.ParentProcessID=ppid;
			myRunningProcess.PriorityClass=dwPriorityClass;
			myRunningProcess.ProcessID=RunningProcessList[i];
			myRunningProcess.ThreadCount=-1;
			procList.push_back(myRunningProcess);
			CloseHandle( hProcess );
		}
	}
}

Note: Following code is taken from [2]

C++

ULONG_PTR GetParentProcessId(HANDLE hProcess) // By Napalm @ NetCore2K
{

	ULONG_PTR pbi[6];
	ULONG ulSize = 0;
	LONG (WINAPI *NtQueryInformationProcess)(HANDLE ProcessHandle, ULONG ProcessInformationClass,
	PVOID ProcessInformation, ULONG ProcessInformationLength, PULONG ReturnLength); 

	*(FARPROC *)&NtQueryInformationProcess = 
	GetProcAddress(LoadLibraryA("NTDLL.DLL"), "NtQueryInformationProcess");

	if(NtQueryInformationProcess)
	{
		if(NtQueryInformationProcess(hProcess, 0,
		&pbi, sizeof(pbi), &ulSize) >= 0 && ulSize == sizeof(pbi))
		return pbi[5];
	}
	return (ULONG_PTR)-1;

}

The result of the query is:

PROCESS NAME	Process ID	Thread count	Parent process ID	Priority base
[System Process]	0x00000000	4	0x00000000	0

Now the question is from where process list specified in where clause come from? The answer is that during SQL query parsing the where clause is converted to running process list. The SQL grammar is augmented with semantic rules. The changes in semantic rules are highlighted in bold below:

C++

expr:
          expr or expr %prec OR_SYM
          {
            /*
              Design notes:
              Do not use a manually maintained stack like thd->lex->xxx_list,
              but use the internal bison stack ($$, $1 and $3) instead.
              Using the bison stack is:
              - more robust to changes in the grammar,
              - guaranteed to be in sync with the parser state,
              - better for performances (no memory allocation).
            */
            Item_cond_or *item1;
            Item_cond_or *item3;
            if (is_cond_or($1))
            {
              item1= (Item_cond_or*) $1;
              if (is_cond_or($3))
              {
                item3= (Item_cond_or*) $3;
                /*
                  (X1 OR X2) OR (Y1 OR Y2) ==> OR (X1, X2, Y1, Y2)
                */
                SemanticRule($1,$3,$3,false); 
                item3->add_at_head(item1->argument_list());
                $$ = $3;
              }
              else
              {
                /*
                  (X1 OR X2) OR Y ==> OR (X1, X2, Y)
                */
                HandleCase($3);
                SemanticRule($1,$3,$1,false);
                item1->add($3);

                $$ = $1;
              }
            }
            else if (is_cond_or($3))
            {
              item3= (Item_cond_or*) $3;
              /*
                X OR (Y1 OR Y2) ==> OR (X, Y1, Y2)
              */
              HandleCase($1);
              SemanticRule($1,$3,$3,false);
              item3->add_at_head($1);
              $$ = $3;
            }
            else
            {
              /* X OR Y */
			  
              $$ = new (YYTHD->mem_root) Item_cond_or($1, $3);
              if ($$ == NULL)
                MYSQL_YYABORT;				
              HandleCase($1);
              HandleCase($3);
              SemanticRule($1,$3,$$,false);
			}
          }
        | expr XOR expr %prec XOR
          {
            /* XOR is a proprietary extension */
            $$ = new (YYTHD->mem_root) Item_func_xor($1, $3);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | expr and expr %prec AND_SYM
          {
            /* See comments in rule expr: expr or expr */
            Item_cond_and *item1;
            Item_cond_and *item3;
            if (is_cond_and($1))
            {
              item1= (Item_cond_and*) $1;
              if (is_cond_and($3))
              {
                item3= (Item_cond_and*) $3;
                /*
                  (X1 AND X2) AND (Y1 AND Y2) ==> AND (X1, X2, Y1, Y2)
                */
                SemanticRule($1,$3,$3,true);
                item3->add_at_head(item1->argument_list());
                $$ = $3;
              }
              else
              {
                /*
                  (X1 AND X2) AND Y ==> AND (X1, X2, Y)
                */
                HandleCase($3);
                SemanticRule($1,$3,$1,true);
                item1->add($3);
                $$ = $1;
              }
            }
            else if (is_cond_and($3))
            {
              item3= (Item_cond_and*) $3;
              /*
                X AND (Y1 AND Y2) ==> AND (X, Y1, Y2)
              */
              HandleCase($1);
              SemanticRule($1,$3,$3,true);
              item3->add_at_head($1);
              $$ = $3;
            }
            else
            {
              /* X AND Y */
              $$ = new (YYTHD->mem_root) Item_cond_and($1, $3);
              if ($$ == NULL)
                MYSQL_YYABORT;
              HandleCase($1);
              HandleCase($3);
              SemanticRule($1,$3,$$,true);
            }
          }
        | NOT_SYM expr %prec NOT_SYM
          {
            $$= negate_expression(YYTHD, $2);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS TRUE_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_istrue($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS not TRUE_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_isnottrue($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS FALSE_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_isfalse($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS not FALSE_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_isnotfalse($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS UNKNOWN_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_isnull($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri IS not UNKNOWN_SYM %prec IS
          {
            $$= new (YYTHD->mem_root) Item_func_isnotnull($1);
            if ($$ == NULL)
              MYSQL_YYABORT;
          }
        | bool_pri
        ;

where_clause:
          /* empty */  { Select->where= 0; }
        | WHERE
          {
            Select->parsing_place= IN_WHERE;
          }
          expr
          {
            SELECT_LEX *select= Select;
            select->where= $3;
            HandleCase($3);
            select->parsing_place= NO_MATTER;
            if ($3)
              $3->top_level_item();
          }
        ;

The following function extracts process id from equal expression For example processid=0x00000000.

C++

void Item_func_eq_case(Item *item)
{
	Item_func * itemFunction=dynamic_cast<item_func*>(item);
	Item **arguments=0;
	if(itemFunction)
	{
		arguments=itemFunction->arguments();
	}
	else
	{
		return;
	}
	Item *field=0;
	Item *value=0;
	if(dynamic_cast <item_field*>(arguments[0]))
	{
		field = arguments[0];
	}
	else
	{
		return;
	}
	if(dynamic_cast <item_field*>(arguments[1]))
	{
		return;
	}
	else
	{
		value = arguments[1];
	}
	
	if(field&&field->item_name.ptr()&&_stricmp(field->item_name.ptr(),"ProcessID")==0)
	{
		item->RunningProcessList=(int*)sql_alloc(sizeof(int)*1);
		item->RunningProcessListSize=1;
		item->RunningProcessList[0]=value->val_int();
	}
}

The following function extracts list of process ids from in expression. For example processid in (0x00000000,0x00000124)

C++

void Item_func_in_case(Item *item)
{
	Item_func * itemFunction=dynamic_cast<item_func*>(item);
	Item **arguments=0;
	int inArgcount=0;
	if(itemFunction)
	{
		arguments=itemFunction->arguments();
		inArgcount=itemFunction->arg_count;
	}
	else
	{
		return;
	}
	Item *field=0;
	Item *value=0;
	if(dynamic_cast <item_field*>(arguments[0]))
	{
		field = arguments[0];
	}
	else
	{
		return;
	}
	if(field&&field->item_name.ptr()&&_stricmp(field->item_name.ptr(),"ProcessID")==0)
	{
		item->RunningProcessList=(int*)sql_alloc(sizeof(int)*inArgcount);
		item->RunningProcessListSize=0;
		LONG index;
		for (index = 1; index < inArgcount; index++)
		{
			if(dynamic_cast < item_field* >(arguments[index]))
			{
				continue;
			}
			else
			{
				value = arguments[index];
			}
			item->RunningProcessList[item->RunningProcessListSize]=value->val_int();
			item->RunningProcessListSize++;
		}
	}
}

The following function calls one of the above functions depending upon type of Item

C++

void HandleCase(Item *item)
{
	if(dynamic_cast<item_func_in*>(item))
	{
		Item_func_in_case(item);
	}
	else if(dynamic_cast<item_func_eq*>(item))
	{
		Item_func_eq_case(item);
	}
}

The following function performs and between the arrays containing running processes. The process id's common to both arrays are in the resulting running process array.

C++

void CombineAnd(Item *arg1,Item *arg2,int *&resultRunningProcessList,int &resultRunningProcessListSize)
{
	if(arg1->RunningProcessListSize+arg2->RunningProcessListSize>0)
	{
		resultRunningProcessList=(int*)sql_alloc(sizeof(int)*(arg1->RunningProcessListSize+arg2->RunningProcessListSize));
		resultRunningProcessListSize=0;
		for(int i=0;i<arg1->RunningProcessListSize;i++)
		{
			for(int j=0;j<arg2->RunningProcessListSize;j++)
			{
				if(arg1->RunningProcessList[i]==arg2->RunningProcessList[j])
				{
					resultRunningProcessList[resultRunningProcessListSize]=arg1->RunningProcessList[i];
					resultRunningProcessListSize++;
					break;
				}
			}
		}
	}

}

The following function performs or between the arrays containing running processes. The process id's common to both arrays are included one time and rest are also the part of the resulting running process array.

C++

void CombineOR(Item *arg1,Item *arg2,int *&resultRunningProcessList,int &resultRunningProcessListSize)
{
	if(arg1->RunningProcessListSize+arg2->RunningProcessListSize>0)
	{
		resultRunningProcessList=(int*)sql_alloc(sizeof(int)*(arg1->RunningProcessListSize+arg2->RunningProcessListSize));
		resultRunningProcessListSize=0;
		for(int i=0;i<arg1->RunningProcessListSize;i++)
		{
			resultRunningProcessList[resultRunningProcessListSize]=arg1->RunningProcessList[i];
			resultRunningProcessListSize++;
		}
		for(int j=0;j<arg2->RunningProcessListSize;j++)
		{
			resultRunningProcessList[rultRunningProcessListSize]=arg2->RunningProcessList[j];
			resultRunningProcessListSize++;
		}
	}
}

The following function calls above functions depending upon the type of the operator (And,Or).

C++

void SemanticRule(Item *arg1,Item *arg2,Item *result,bool isAnd)
{
	int *resultRunningProcessList;
	int resultRunningProcessListSize;
	if(isAnd)
	{
		CombineAnd(arg1,arg2,resultRunningProcessList,resultRunningProcessListSize);
	}
	else
	{
		CombineOR(arg1,arg2,resultRunningProcessList,resultRunningProcessListSize);
	}
	result->RunningProcessList=resultRunningProcessList;
	result->RunningProcessListSize=resultRunningProcessListSize;
}

The user can enter complex queries. During parsing where clause is converted to calls to OpenProcess API and CreateToolhelp32Snapshot, Process32First, Process32Next when needed. For example if a portion of where clause contains process id the OpenProcess API should be called otherwise CreateToolhelp32Snapshot, Process32First, Process32Next should be called. Similarly, When the user issues a query select * from ProcessModulesList the control passes to ProcessInfoEngine which checks the table name and where clause. In this case where clause is empty therefore the engine takes a snapshot of all running processes using following win32 api

CreateToolhelp32Snapshot
Process32First
Process32Next
Module32First
Module32Next

The engine feeds the process module information rows to mysql one by one. MySQL shows the table ProcessModulesList data

Process ID	MODULE NAME	Executable	Base address	Base size
0x00000000	ntdll.dll	C:\Windows\SysWOW64\ntdll.dll	0x77430000	1572864
0x00000000	kernel32.dll	C:\Windows\syswow64\kernel32.dll	0x75000000	1048576
0x00000ED0	EXCEL.EXE	C:\Program Files (x86)\Microsoft Office\Office12\EXCEL.EXE	0x2F8F0000	18411520
0x00000ED0	ntdll.dll	C:\Windows\SysWOW64\ntdll.dll	0x77430000	1572864
0x00000ED0	kernel32.dll	C:\Windows\syswow64\kernel32.dll	0x75000000	1048576
0x00000ED0	ADVAPI32.dll	C:\Windows\syswow64\ADVAPI32.dll	0x76480000	655360
0x00000ED0	msvcrt.dll	C:\Windows\syswow64\msvcrt.dll	0x765E0000	704512

The user can also join RunningProcessList and ProcessModulesList. For example select * from RunningProcessList, ProcessModulesList where RunningProcessList .Process ID= ProcessModulesList .Process ID The engine has nothing to do with the join it is done byMYSQL.

The idea can be extended to API’s available:

Flickr API(almost all the functionality that runs flickr.com is available through the API)
Graph API
foursquare API
Tumblr API
The Remember The Milk API
Google+ API
FriendFeed API
Geolocation API
The Goodreads API

For each source of API’s/functions a storage engine can be created like Flick Storage engine. Flick Storage engine will call Flickr API [3] to retrieve data from Flickr. The user will write MySQL queries to retrieve data from Flickr. For example: Create a table named “Group” with engine=Flickr_Storage_Engine having following columns:

Nsid
Name
Iconfarm
Iconserver
Admin
Eighteenplus
invitation_only
members
pool_count
api_key
user_id
extras

The user writes following query: select * from Group where api_key=’ABCD’ and user_id=1; During MYSQL parsing the above query generates calls to Flickr API. Flickr storage engine calls api flickr.people.getGroups passing api_key and user_id as parameter. The API returns following xml

XML

<group nsid="17274427@N00" name="Cream of the Crop - Please read the rules" iconfarm="1" iconserver="1" admin="0" eighteenplus="0" invitation_only="0" members="11935" pool_count="12522">
  <group nsid="20083316@N00" name="Apple" iconfarm="1" iconserver="1" admin="0" eighteenplus="0" invitation_only="0" members="11776" pool_count="62438">
  <group nsid="34427469792@N01" name="FlickrCentral" iconfarm="1" iconserver="1" admin="0" eighteenplus="0" invitation_only="0" members="168055" pool_count="5280930">
  <group nsid="37718678610@N01" name="Typography and Lettering" iconfarm="1" iconserver="1" admin="0" eighteenplus="0" invitation_only="0" members="17318" pool_count="130169">
</group></group> </group></group></groups>

Flickr engine parses the XML and feeds the rows to mysql. The result of the query is:

Nsid<o:p>	Name<o:p>	Iconfarm<o:p>	Iconserver<o:p>	Admin<o:p>	Eighteenplus<o:p>	invitation_only<o:p>	members<o:p>	pool_count<o:p>
17274427@N00<o:p>	Cream of the Crop - Please read the rules<o:p> <o:p>	1<o:p>	1<o:p>	0<o:p>	0<o:p>	0<o:p>	11935<o:p> <o:p>	12522<o:p> <o:p>
20083316@N00<o:p>	Apple<o:p> <o:p>	1<o:p>	1<o:p>	0<o:p>	0<o:p>	0<o:p>	11776<o:p> <o:p>	62438<o:p> <o:p>
34427469792@N01<o:p> <o:p>	FlickrCentral<o:p> <o:p>	1<o:p>	1<o:p>	0<o:p>	0<o:p>	0<o:p>	168055<o:p> <o:p>	5280930<o:p> <o:p>
37718678610@N01<o:p> <o:p>	Typography and Lettering<o:p> <o:p>	1<o:p>	1<o:p>	0<o:p>	0<o:p>	0<o:p>	17318<o:p> <o:p>	130169<o:p> <o:p>

This is a simple query but user can write complex queries involving joins and aggregates. The user can even join the data retrieved from different API’s.

Where to get the source code:

Download the attachment with this article. It contains following files:

ha_example.cc
ha_example.h
item.cc
item.h
sql_yacc.yy

Download the MySQL source code from the following link: http://sourceforge.net/projects/mysql.mirror/files/MySQL%205.6.13/

Copy ha_example.cc to mysql-5.6.13\storage\example
Copy ha_example.h to mysql-5.6.13\storage\example
Copy item.cc to mysql-5.6.13\sql
Copy item.h to mysql-5.6.13\sql
Copy sql_yacc.yy to mysql-5.6.13\sql

How to build the source code:

Please follow the instruction given at: http://dev.mysql.com/doc/refman/5.6/en/installing-source-distribution.html

Conclusion

We have found a way to harness the power of mysql parser to do analysis on data that was previously never done. This opens up the horizon for new kinds of analysis This approach can be generalized. I want to write a series of articles on the topic

Tutorial on Storage Engines:

For more information: http://dev.mysql.com/doc/internals/en/custom-engine.html

References:

[1]http://msdn.microsoft.com/en-us/library/windows/desktop/ms686701(v=vs.85).aspx
[2] http://stackoverflow.com/questions/185254/how-can-a-win32-process-get-the-pid-of-its-parent
[3]http://www.flickr.com/services/developer/

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)