The easiest way to check the content of a web page is to download it locally (to a file) and search through it. To accomplish this, we look at the URLDownloadToFile function. The article then looks at the architecture of IntelliLink.
Introduction
Have you ever owned a website? Did you do some sysadmin work for somebody else? Have you made link exchange? If so, you probably wish to monitor the internal/external links from/to your site.
Background
The easiest way to check the contents of a web page is to download it locally (to a file) and search through it. To accomplish this, we use URLDownloadToFile
function, which has the following syntax:
HRESULT URLDownloadToFile(
LPUNKNOWN pCaller,
LPCTSTR szURL,
LPCTSTR szFileName,
_Reserved_ DWORD dwReserved,
LPBINDSTATUSCALLBACK lpfnCB);
Parameters
pCaller
: A pointer to the controlling IUnknown
interface of the calling ActiveX component, if the caller is an ActiveX component. If the calling application is not an ActiveX component, this value can be set to NULL
. Otherwise, the caller is a COM
object that is contained in another component, such as an ActiveX control in the context of an HTML page. This parameter represents the outermost IUnknown
of the calling component. The function attempts the download in the context of the ActiveX client framework, and allows the caller container to receive callbacks on the progress of the download. szURL
: A pointer to a string
value that contains the URL to download. Cannot be set to NULL
. If the URL is invalid, INET_E_DOWNLOAD_FAILURE
is returned. szFileName
: A pointer to a string
value containing the name or full path of the file to create for the download. If szFileName
includes a path, the target directory must already exist. dwReserved
: Reserved. Must be set to 0
. lpfnCB
: A pointer to the IBindStatusCallback
interface of the caller. By using IBindStatusCallback::OnProgress
, a caller can receive download status. URLDownloadToFile
calls the IBindStatusCallback::OnProgress
and IBindStatusCallback::OnDataAvailable
methods as data is received. The download operation can be cancelled by returning E_ABORT
from any callback. This parameter can be set to NULL
if status is not required.
Return Value
This function can return one of these values:
S_OK
: The download started successfully. E_OUTOFMEMORY
: The buffer length is invalid, or there is insufficient memory to complete the operation. INET_E_DOWNLOAD_FAILURE
: The specified resource or callback interface was invalid.
So, our implementation using the above function will be:
BOOL ProcessHTML(CString strFileName, CString strSourceURL,
CString strTargetURL, CString strURLName)
{
CString strURL;
CString strFileLine;
CString strLineMark;
BOOL bRetVal = FALSE;
try
{
CStdioFile pInputFile(strFileName, CFile::modeRead | CFile::typeText);
while (pInputFile.ReadString(strFileLine))
{
int nIndex = strFileLine.Find(_T("href="), 0);
while (nIndex >= 0)
{
const int nFirst = strFileLine.Find(_T('\"'), nIndex);
if (nFirst >= 0)
{
const int nLast = strFileLine.Find(_T('\"'), nFirst + 1);
if (nLast >= 0)
{
strURL = strFileLine.Mid(nFirst + 1, nLast - nFirst - 1);
if (strURL.CompareNoCase(strTargetURL) == 0)
{
TRACE(_T("URL found - %s\n"), strTargetURL);
strLineMark.Format(_T(">%s<"), strURLName);
if (strFileLine.Find(strLineMark, nLast + 1) >= 0)
{
TRACE(_T("Name found - %s\n"), strURLName);
bRetVal = TRUE;
}
}
}
}
nIndex = (nFirst == -1) ? -1 : strFileLine.Find(_T("href="), nFirst + 1);
}
}
pInputFile.Close();
}
catch (CFileException* pFileException)
{
TCHAR lpszError[MAX_STR_LENGTH] = { 0 };
pFileException->GetErrorMessage(lpszError, MAX_STR_LENGTH);
pFileException->Delete();
OutputDebugString(lpszError);
bRetVal = FALSE;
}
VERIFY(DeleteFile(strFileName));
return bRetVal;
}
BOOL CLinkData::IsValidLink()
{
BOOL bRetVal = TRUE;
TCHAR lpszTempPath[MAX_STR_LENGTH] = { 0 };
TCHAR lpszTempFile[MAX_STR_LENGTH] = { 0 };
const DWORD dwTempPath = GetTempPath(MAX_STR_LENGTH, lpszTempPath);
lpszTempPath[dwTempPath] = '\0';
if (GetTempFileName(lpszTempPath, _T("html"), 0, lpszTempFile) != 0)
{
TRACE(_T("URLDownloadToFile(%s)...\n"), GetSourceURL());
if (URLDownloadToFile(NULL, GetSourceURL(), lpszTempFile, 0, NULL) == S_OK)
{
if (!ProcessHTML(lpszTempFile, GetSourceURL(), GetTargetURL(), GetURLName()))
{
TRACE(_T("ProcessHTML(%s) has failed\n"), lpszTempFile);
bRetVal = FALSE;
}
}
else
{
TRACE(_T("URLDownloadToFile has failed\n"));
bRetVal = FALSE;
}
}
else
{
TRACE(_T("GetTempFileName has failed\n"));
bRetVal = FALSE;
}
return bRetVal;
}
The Architecture
What do Source URL, Target URL, and URL Name mean in the above piece of code?
- Source URL = what web page to check
- Target URL = what link should be on the above web page
- URL Name = what name should be for the above link
Each URL definition is contained in a CLinkData
class, with the following interface:
DWORD GetLinkID();
- Gets ID of the current URL definition void SetLinkID(DWORD dwLinkID);
- Sets ID for the current URL definition CString GetSourceURL();
- Gets Source URL for current URL definition void SetSourceURL(CString strSourceURL);
- Sets Source URL for current URL definition CString GetTargetURL();
- Gets Target URL for current URL definition void SetTargetURL(CString strTargetURL);
- Sets Target URL for current URL definition CString GetURLName();
- Gets URL Name for current URL definition void SetURLName(CString strURLName);
- Sets URL Name for current URL definition int GetPageRank();
currently not implemented void SetPageRank(int nPageRank);
currently not implemented BOOL GetStatus();
- Gets status for current URL definition void SetStatus(BOOL bStatus);
- Sets status for current URL definition
Then, we define CLinkList
as typedef CArray<CLinkData*> CLinkList;
.
This list is managed inside the CLinkSnapshot
class, with the following interface:
BOOL RemoveAll();
- Removes all URL definitions from list int GetSize();
- Gets the size of URL definition list CLinkData* GetAt(int nIndex);
- Gets an URL definition from list BOOL Refresh();
- Updates the status for each URL definition from list CLinkData* SelectLink(DWORD dwLinkID);
- Searches for a URL definition by its ID DWORD InsertLink(CString strSourceURL, CString strTargetURL, CString strURLName, int nPageRank, BOOL bStatus);
- Inserts a URL definition into list BOOL DeleteLink(DWORD dwLinkID);
- Removes an URL definition from list BOOL LoadConfig();
- Loads the URL definition list from XML file BOOL SaveConfig();
- Saves the URL definition list to XML file
The Good, the Bad, and the Ugly
The good thing is that I learned to use Windows ribbons. The bad thing is that I still don't know how to get a web page's PageRank
value. The ugly thing is that the processing (i.e., checking link validity) should be done in a separate working thread, but I am planning this change for the next release. Stay tuned!
Final Words
IntelliLink application uses many components that have been published on CodeProject. Many thanks to:
- My
CMFCListView
form view (see source code) - Lee Thomason for his
TinyXML2
class - PJ Naughter for his
CTrayNotifyIcon
class - PJ Naughter for his
CVersionInfo
class
Further plans: I would like to add support for Google's PageRank as soon as possible.
History
- Version 1.04 (November 9th, 2014): Initial release
- Moved source code from CodeProject to GitLab (April 10th, 2020)
- Moved source code from GitLab to GitHub (February 23rd, 2022)
- Version 1.05 (April 28th, 2022): Added setup project
- Version 1.06 (May 23rd, 2022): Added program to Startup Apps
- Version 1.07 (August 20th, 2022): Updated font size of About dialog
- Version 1.08 (August 26th, 2022): Removed program from Startup Apps
- Version 1.09 (September 9th, 2022): Added Contributors hyperlink to AboutBox dialog
- Version 1.10 (January 23rd, 2023): Updated PJ Naughter's
CVersionInfo
library to the latest version available
Updated the code to use C++ uniform initialization for all variable declarations
- Version 1.11 (January 24rd, 2023): Updated PJ Naughter's
CInstanceChecker
library to the latest version available
Updated the code to use C++ uniform initialization for all variable declarations
- Replaced
NULL
throughout the codebase with nullptr
Replaced BOOL
throughout the codebase with bool
This means that the minimum requirement for the application is now Microsoft Visual C++ 2010 - Version 1.12 (May 27th, 2023): Updated About dialog with GPLv3 notice
- Version 1.13 (June 16th, 2023): Made persistent the length of columns from interface
- Version 1.14 (June 24th, 2023): Updated PJ Naughter's
CTrayNotifyIcon
library to the latest version available - Version 1.15 (July 20th, 2023): Extended application's functionality with two new buttons: Website Review and Webmaster Tools
- Version 1.16 (August 20th, 2023):
- Changed article's download link. Updated the About dialog (email & website)
- Added social media links: Twitter, LinkedIn, Facebook, and Instagram
- Added shortcuts to GitHub repository's Issues, Discussions, and Wiki
- Version 1.17 (October 29th, 2023): Updated PJ Naughter's
CTrayNotifyIcon
library to the latest version available
Fixed an issue where the CTrayNotifyIcon::OnTrayNotification
callback method would not work correctly if the m_NotifyIconData.uTimeout
member variable gets updated during runtime of client applications. This can occur when you call CTrayNotifyIcon::SetBalloonDetails
. Thanks to Maisala Tuomo for reporting this bug.
- Version 1.18 (January 27th, 2024): Added ReleaseNotes.html and SoftwareContentRegister.html to GitHub repo
- Version 1.19 (February 21st, 2024): Switched MFC application' theme back to native Windows
- Version 1.20 (September 6th, 2024):
- Replaced old
XML
library from CodeProject with Lee Thomason's TinyXML2
library. - Implemented User Manual option into Help menu.
- Implemented Check for updates... option into Help menu.