Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

How to Automate Saving Webpages as a Single .MHTML Files using Selenium Webdriver

0.00/5 (No votes)
25 Sep 2023CPOL3 min read 5.1K  
Saving webpages in single self contained files using Selenium Webdriver
A simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder.

Introduction

EdgeSinglePageDownloader is a simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder using Edge Selenium Webdriver (it should work also on ChromeDriver). It is a simple proof of concept which shows how to implement this feature using the Selenium Webdriver and allows also to download and save a set of webpages in batch.

Background

The Edge Selenium Webdriver is a NuGet package which allows you to automate Microsoft Edge by simulating user interaction. The saving of the browsed web page happens by sending CTRL + S key to Edge in order to pop up the Save As dialog, specifying then a filename, selecting the file format (Webpage, single file .mhtml) and clicking the Save button. Unluckily, the method SendKeys provided by Selenium Webdriver does not work (or at least I was not able to in Windows), so after several tries, I switched to using VBScript SendKeys method which works flawlessly with the caveat of requiring Windows as operating system.

Using the Code

Using the code is very simple, all you have to do is:

  • adjust the DefaultSaveFolder constant to specify the local folder where to save the mhtml files, it defaults to C:\temp
    C#
    const string DefaultSaveFolder = "c:\\temp";
  • adjust urlsToSave variable initialization with the urls you would like to save, by default The Verge and Wired urls are provided:
    C#
    var urlsToSave = new List<string> 
                     { "https://www.theverge.com", "https://www.wired.com/" };

After that, just run the code, Edge will be started, all the urlsToSave will be browsed sequentially and they will be saved in the DefaultSaveFolder with filenames Page_1.mhtml, Page_2.mhtml, ..., Page_n.mhtml.

The code is very simple and it is all contained in the Main function with the SaveAsSingleFile helper function.

The Main function performs these steps:

  • Loops for all urls inside the urlsToSave variable and for each of them:
    • Instantiates a new EdgeDriver class (provided by Selenium Webdriver) which starts a new Edge browser
    • Makes the browser navigate to the url by calling EdgeDriver.Navigate().GoToUrl
    • Saves the webpage as a single .mhtml file by calling helper function SaveAsSingleFile
C#
static void Main(string[] args)
{
    var options = new EdgeOptions();

    var service = EdgeDriverService.CreateDefaultService();
    service.EnableVerboseLogging = true;

    WshShell = new WshShellClass();

    var urlsToSave = new List<string> 
        { "https://www.theverge.com", "https://www.wired.com/" };

    var i = 1;
    foreach (var url in urlsToSave)
    {
        Driver = new EdgeDriver(service, options);

        Driver.Navigate().GoToUrl(url);

        SaveAsSingleFile(Path.Combine(DefaultSaveFolder, 
                         $"Page_{i++}.mhtml"),url);

        Driver.Close();
            }
    }

The SaveAsSingleFile helper function performs these steps:

  • Checks whether the output directory exists and creates it if not
  • Checks whether the output file exists and it is in the .mhtml format, if yes, it exists without re-saving it again otherwise it deletes the existing file
  • Sends CTRL (^ character) + S key to Edge browser by using WshShell.SendKeys, this pops up the "Save as" dialog (image below). Notice that the "Filename" and "Save as type" labels of textboxes have an underlined char respectively 'n' and 't', you can focus on their controls by pressing ALT + one of these characters. Please pay attention to the fact that these shortcuts are localization-dependent (I am using English localized Windows), in your Windows installation they could be different ones, so adapt them if necessary.
  • Sends ALT (% character) + 'n' and the filename passed to the function
  • Sends ALT (% character) + 't', DOWN ARROW to open all "Save as Type" possible formats , UP ARROW to select "WebPage, Single File (*.mhtml)" and ENTER (~ character) twice to confirm the file format and press the "Save" button.
  • After this, it waits for 1 minute (specified by MaxWaitForSaveMSec constant) to check that the saved file has been created and it is in the MHTML format (it checks it contains the string "Snapshot-Content-Location: {url}"), if not, it goes back to the beginning of the function to redo everything again.

    Image 1

C#
static void SaveAsSingleFile(string filename, string url)
{
again:
    if (!Directory.Exists(Path.GetDirectoryName(filename)))
        Directory.CreateDirectory(Path.GetDirectoryName(filename));
        
    if (System.IO.File.Exists(filename))
    {
        // simple check that the existing file format is mhtml, 
        // otherwise delete and re-save it
        if (!System.IO.File.ReadAllText(filename).Contains
                            ($"Snapshot-Content-Location: {url}"))
            System.IO.File.Delete(filename);
        else
            return;
    }
    
    WshShell.SendKeys("^s");
    Thread.Sleep(1000);
    // send alt+n, enter filename
    WshShell.SendKeys($"%n{filename}");
    Thread.Sleep(20);
    // send alt+t, down arrow, up arrow (to select single mhtml), press enter twice
    WshShell.SendKeys($"%t");
    Thread.Sleep(20);
    WshShell.SendKeys($"{{DOWN}}");
    Thread.Sleep(20);
    WshShell.SendKeys($"{{UP}}");
    Thread.Sleep(20);
    WshShell.SendKeys($"~~");
    
    // waits up to MaxWaitForSaveMSec to check that the file is saved correctly
    var endtime = DateTime.Now.AddMilliseconds(MaxWaitForSaveMSec);
    
    while (DateTime.Now < endtime)
    {
        Thread.Sleep(1000);
        // simple check that the file is present and its format is mhtml, 
        // otherwise retry again to save
        if (System.IO.File.Exists(filename))
        {
            if (!System.IO.File.ReadAllText(filename).Contains
                                ($"Snapshot-Content-Location: {url}"))
                goto again;
            else
                break;
        }
    }
}

Points of Interest

To use the VBScript SendKeys method, you have to create an instance of WScript.Shell COM object. The easiest way to do this is to reference directly its ActiveX Control file by right clicking project file --> Add --> Com Reference --> Browse --> Select C:\Windows\SysWOW64\wshom.ocx.

You should be seeing in Dependencies/COM node in Visual Studio Interop.IWshRuntimeLibrary. Click on it and change "Embed Interop Types" from Yes to No (if you use Net Core).

Interop.IWshRuntimeLibrary

After doing this, you can simply instantiate the WScript.Shell COM object by:

C#
var wshShell = new WshShellClass();

History

  • V1.0 (22nd September, 2023): Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)