A simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder.
Introduction
EdgeSinglePageDownloader is a simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder using Edge Selenium Webdriver (it should work also on ChromeDriver). It is a simple proof of concept which shows how to implement this feature using the Selenium Webdriver and allows also to download and save a set of webpages in batch.
Background
The Edge Selenium Webdriver is a NuGet package which allows you to automate Microsoft Edge by simulating user interaction. The saving of the browsed web page happens by sending CTRL + S key to Edge in order to pop up the Save As dialog, specifying then a filename, selecting the file format (Webpage, single file .mhtml) and clicking the Save button. Unluckily, the method SendKeys
provided by Selenium Webdriver does not work (or at least I was not able to in Windows), so after several tries, I switched to using VBScript SendKeys method which works flawlessly with the caveat of requiring Windows as operating system.
Using the Code
Using the code is very simple, all you have to do is:
- adjust the
DefaultSaveFolder
constant to specify the local folder where to save the mhtml files, it defaults to C:\temp
const string DefaultSaveFolder = "c:\\temp";
- adjust
urlsToSave
variable initialization with the urls you would like to save, by default The Verge and Wired urls are provided:
var urlsToSave = new List<string>
{ "https://www.theverge.com", "https://www.wired.com/" };
After that, just run the code, Edge will be started, all the urlsToSave
will be browsed sequentially and they will be saved in the DefaultSaveFolder with filenames Page_1.mhtml, Page_2.mhtml, ..., Page_n.mhtml.
The code is very simple and it is all contained in the Main
function with the SaveAsSingleFile
helper function.
The Main
function performs these steps:
- Loops for all urls inside the
urlsToSave
variable and for each of them:
- Instantiates a new
EdgeDriver
class (provided by Selenium Webdriver) which starts a new Edge browser - Makes the browser navigate to the url by calling
EdgeDriver.Navigate().GoToUrl
- Saves the webpage as a single .mhtml file by calling helper function
SaveAsSingleFile
static void Main(string[] args)
{
var options = new EdgeOptions();
var service = EdgeDriverService.CreateDefaultService();
service.EnableVerboseLogging = true;
WshShell = new WshShellClass();
var urlsToSave = new List<string>
{ "https://www.theverge.com", "https://www.wired.com/" };
var i = 1;
foreach (var url in urlsToSave)
{
Driver = new EdgeDriver(service, options);
Driver.Navigate().GoToUrl(url);
SaveAsSingleFile(Path.Combine(DefaultSaveFolder,
$"Page_{i++}.mhtml"),url);
Driver.Close();
}
}
The SaveAsSingleFile
helper function performs these steps:
- Checks whether the output directory exists and creates it if not
- Checks whether the output file exists and it is in the .mhtml format, if yes, it exists without re-saving it again otherwise it deletes the existing file
- Sends CTRL (^ character) + S key to Edge browser by using
WshShell.SendKeys
, this pops up the "Save as" dialog (image below). Notice that the "Filename
" and "Save as type
" labels of textboxes have an underlined char
respectively 'n
' and 't
', you can focus on their controls by pressing ALT + one of these characters. Please pay attention to the fact that these shortcuts are localization-dependent (I am using English localized Windows), in your Windows installation they could be different ones, so adapt them if necessary. - Sends ALT (% character) + 'n' and the filename passed to the function
- Sends ALT (% character) + 't', DOWN ARROW to open all "Save as Type" possible formats , UP ARROW to select "WebPage, Single File (*.mhtml)" and ENTER (~ character) twice to confirm the file format and press the "Save" button.
- After this, it waits for 1 minute (specified by
MaxWaitForSaveMSec
constant) to check that the saved file has been created and it is in the MHTML format (it checks it contains the string
"Snapshot-Content-Location: {url}
"), if not, it goes back to the beginning of the function to redo everything again.
data:image/s3,"s3://crabby-images/81df5/81df539a0c7f342358cb84681658d6d55d6ff65a" alt="Image 1"
static void SaveAsSingleFile(string filename, string url)
{
again:
if (!Directory.Exists(Path.GetDirectoryName(filename)))
Directory.CreateDirectory(Path.GetDirectoryName(filename));
if (System.IO.File.Exists(filename))
{
if (!System.IO.File.ReadAllText(filename).Contains
($"Snapshot-Content-Location: {url}"))
System.IO.File.Delete(filename);
else
return;
}
WshShell.SendKeys("^s");
Thread.Sleep(1000);
WshShell.SendKeys($"%n{filename}");
Thread.Sleep(20);
WshShell.SendKeys($"%t");
Thread.Sleep(20);
WshShell.SendKeys($"{{DOWN}}");
Thread.Sleep(20);
WshShell.SendKeys($"{{UP}}");
Thread.Sleep(20);
WshShell.SendKeys($"~~");
var endtime = DateTime.Now.AddMilliseconds(MaxWaitForSaveMSec);
while (DateTime.Now < endtime)
{
Thread.Sleep(1000);
if (System.IO.File.Exists(filename))
{
if (!System.IO.File.ReadAllText(filename).Contains
($"Snapshot-Content-Location: {url}"))
goto again;
else
break;
}
}
}
Points of Interest
To use the VBScript SendKeys method, you have to create an instance of WScript.Shell COM object. The easiest way to do this is to reference directly its ActiveX Control file by right clicking project file --> Add --> Com Reference --> Browse --> Select C:\Windows\SysWOW64\wshom.ocx.
You should be seeing in Dependencies/COM node in Visual Studio Interop.IWshRuntimeLibrary
. Click on it and change "Embed Interop Types" from Yes to No (if you use Net Core).
data:image/s3,"s3://crabby-images/472df/472dfdac6477296c3cc2f7509cad30f058103056" alt="Interop.IWshRuntimeLibrary"
After doing this, you can simply instantiate the WScript.Shell COM
object by:
var wshShell = new WshShellClass();
History
- V1.0 (22nd September, 2023): Initial version