Introduction
I made this function when I was asked to create an app to modify the search criteria of a search grid on someone else's site and then only display the grid to the user afterwards. The long and complicated way of doing this would be to create a screen-scraper application and modify the DOM. The simplest way was to copy the entire site and add minimal JavaScript to hide/show specific panels and hard-set what information was passed to the search grid on document load (without having to use screen scraping).
Background
It was a given that the site I copied had jQuery implemented. But either way, you could implement jQuery before or after the copy just in case the site you were referencing did not do so *see new Points of Interest. You will also be required to add the attributes [id="body" runat="server"
] to the <body>
tag, so that the copy method can change the contents of the body tag server-side.
Using the Code
The C# code is as follows:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Linq;
using Microsoft.Ajax.Utilities;
namespace WebStuff
{
public partial class Utilities
{
const RegexOptions _defaultRxFlags = RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline;
const StringComparison _ic = StringComparison.CurrentCultureIgnoreCase;
static Regex _rxPageScriptsOnly = new Regex(@"<(script|style|link)\b[^>]*>[\s\S]*?<\/\1[^>]*>|(<(link|script)[^>]*((\/>)|(>)))", _defaultRxFlags);
static Regex _rxScriptTags = new Regex(@"<(script|style|link)[^>]*>|<\/(script|style|link)[^>]*>", _defaultRxFlags);
static Regex _rxBodyOnly = new Regex(@"<(body)\b[^>]*>[\s\S]*?<\/\1[^>]*>", _defaultRxFlags);
static Regex _rxBodyTags = new Regex(@"<body[^>]*>|<\/body[^>]*>", _defaultRxFlags);
static Regex _rxScriptVersion = new Regex(@"(\p{P}[\d]+)+(.min)?", _defaultRxFlags);
static Regex _rxScriptPath = new Regex(@"[^\/]+$", _defaultRxFlags);
public static void CopyHtmlPage(string url)
{
Page page = HttpContext.Current.Handler as Page;
List<string> parentResidentScripts = new List<string>();
string resListStr = page.Request.Params["ResidentScripts"];
if (!string.IsNullOrEmpty(resListStr))
{
string[] splits = resListStr.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
foreach (string split in splits)
{
parentResidentScripts.Add(GetScriptBaseName(split));
}
}
HtmlGenericControl body = (HtmlGenericControl)page.FindControl("body");
if (body == null)
throw new Exception("No access to modify local <body> tag. Add [id='body' runat='server'] attributes to the <body> tag.");
Uri location = new Uri(url, UriKind.RelativeOrAbsolute);
string htmlText = GetResponseText(url);
Match bodyMatch = _rxBodyOnly.Match(htmlText);
if (!bodyMatch.Success)
throw new Exception("Rendered html has no complete <body>[content]</body> element at [" + url + "]");
{
StringBuilder bodyText = new StringBuilder(
_rxPageScriptsOnly.Replace(
_rxBodyTags.Replace(bodyMatch.Value, "").Trim()
, "")
);
FixAllLinks(ref bodyText, location);
System.Web.UI.Control newBody = page.ParseControl(bodyText.ToString(), true);
if (newBody != null)
body.Controls.Add(newBody);
}
Minifier minifier = new Minifier();
CodeSettings scriptSettings = new CodeSettings();
scriptSettings.MinifyCode = true;
scriptSettings.OutputMode = OutputMode.MultipleLines;
scriptSettings.CollapseToLiteral = true;
scriptSettings.PreserveImportantComments = false;
scriptSettings.EvalTreatment = EvalTreatment.Ignore;
scriptSettings.InlineSafeStrings = true;
scriptSettings.LocalRenaming = LocalRenaming.CrunchAll;
scriptSettings.MacSafariQuirks = (new string[] { "safari", "apple" }).Any(w => page.Request.UserAgent.Contains(w));
scriptSettings.ConstStatementsMozilla = (new string[] { "mozilla" }).Any(w => page.Request.UserAgent.Contains(w));
scriptSettings.PreserveFunctionNames = true;
scriptSettings.RemoveFunctionExpressionNames = true;
scriptSettings.RemoveUnneededCode = false;
scriptSettings.StripDebugStatements = true;
scriptSettings.ReorderScopeDeclarations = true;
MatchCollection pageScripts = _rxPageScriptsOnly.Matches(htmlText);
int controlIndex = 0;
foreach (Match pageScript in pageScripts)
{
string hrefType = " src=";
Control newScript = null;
int checkScriptIndex = pageScript.Value.IndexOf("script", _ic);
int checkStyleIndex = pageScript.Value.IndexOf("style", _ic);
int checkLinkIndex = pageScript.Value.IndexOf("link", _ic);
bool isLinkTag = (checkLinkIndex > -1 && checkLinkIndex < 3);
bool isScriptTag = (checkScriptIndex > -1 && checkScriptIndex < 3);
bool isStyleTag = (checkStyleIndex > -1 && checkStyleIndex < 3);
if (isScriptTag)
{
newScript = new HtmlGenericControl("script");
((HtmlGenericControl)newScript).Attributes.Add("type", "text/javascript");
}
else if (isStyleTag || isLinkTag)
{
newScript = new HtmlGenericControl("style");
((HtmlGenericControl)newScript).Attributes.Add("type", "text/css");
hrefType = " href=";
}
else
continue;
StringBuilder scriptText = new StringBuilder(pageScript.Value);
string workingText = scriptText.ToString();
int srcLength = hrefType.Length + 1;
int srcIndex = workingText.IndexOf(hrefType) + srcLength;
string encap = workingText.Substring(srcIndex - 1, 1);
int endIndex = workingText.IndexOf(encap, srcIndex);
if (isLinkTag && (workingText.IndexOf("text/css", srcIndex, _ic) < 0 || workingText.IndexOf("stylesheet", srcIndex, _ic) < 0)) {
if (srcIndex > 0)
{
string srcUrl = workingText.Substring(srcIndex, endIndex - srcIndex).Trim();
Uri resourceLocation = ResolveUrl(srcUrl, location);
scriptText = scriptText.Replace(srcUrl, resourceLocation.ToString());
}
newScript = new LiteralControl();
}
else if (srcIndex > srcLength && srcIndex < workingText.IndexOf(">") && endIndex > srcIndex)
{
string srcUrl = workingText.Substring(srcIndex, endIndex - srcIndex).Trim();
string baseName = GetScriptBaseName(srcUrl);
if (parentResidentScripts.Contains(baseName, StringComparer.CurrentCultureIgnoreCase))
continue;
Uri resourceLocation = ResolveUrl(srcUrl, location);
((HtmlGenericControl)newScript).Attributes.Add("original", resourceLocation.ToString());
try
{
scriptText = new StringBuilder((isScriptTag) ?
minifier.MinifyJavaScript(
GetResponseText(resourceLocation.ToString())
, scriptSettings)
: minifier.MinifyStyleSheet(
GetResponseText(resourceLocation.ToString())
)
);
FixAllLinks(ref scriptText, resourceLocation);
}
catch (Exception ex)
{
scriptText.Length = 0;
((HtmlGenericControl)newScript).Attributes.Add("error", ex.Message);
}
}
else
{
scriptText = new StringBuilder((isScriptTag) ?
minifier.MinifyJavaScript(
_rxScriptTags.Replace(pageScript.Value, "").Trim()
, scriptSettings)
: minifier.MinifyStyleSheet(
_rxScriptTags.Replace(pageScript.Value, "").Trim()
)
);
FixAllLinks(ref scriptText, location);
}
if (scriptText.Length > 0)
{
if (isScriptTag)
{
scriptText.Insert(0, "<!-- \n");
scriptText.Append("\n -->");
}
if (newScript.GetType() == typeof(HtmlGenericControl))
((HtmlGenericControl)newScript).InnerHtml = scriptText.ToString();
if (newScript.GetType() == typeof(LiteralControl))
((LiteralControl)newScript).Text = scriptText.ToString();
}
if (pageScript.Index < bodyMatch.Index)
{
page.Header.Controls.AddAt(controlIndex, newScript);
controlIndex++;
}
else
{
body.Controls.Add(newScript);
}
}
}
private static string GetScriptBaseName(string scriptUrl)
{
string baseName = _rxScriptPath.Match(scriptUrl).Value;
baseName = _rxScriptVersion.Replace(baseName, "");
return baseName;
}
private static void FixAllLinks(ref StringBuilder fixText, Uri siteUrl)
{
FixLinks("url(", ref fixText, siteUrl);
FixLinks("src=", ref fixText, siteUrl);
FixLinks("href=", ref fixText, siteUrl);
}
private static void FixLinks(string searchType, ref StringBuilder fixText, Uri siteUrl)
{
int urlIndex = 0;
while (urlIndex > -1)
{
string workingText = fixText.ToString();
urlIndex = workingText.IndexOf(searchType, urlIndex, _ic);
if (urlIndex < 0) continue;
urlIndex = urlIndex + searchType.Length;
string urlEncap = fixText[urlIndex].ToString();
if (urlEncap.Equals(@"\"))
{
urlIndex++;
urlEncap += fixText[urlIndex].ToString();
urlIndex++;
}
else if (!urlEncap.Equals("'") && !urlEncap.Equals("\""))
urlEncap = ")";
else
urlIndex++;
int endIndex = workingText.IndexOf(urlEncap, urlIndex);
string srcUrl = workingText.Substring(urlIndex, endIndex - urlIndex);
if (string.IsNullOrEmpty(srcUrl.Trim()) ||
srcUrl.Trim().Equals("#") ||
srcUrl.Trim().StartsWith("javascript:", _ic) ||
srcUrl.Trim().Equals("/a", _ic))
continue;
Uri resourceLocation = ResolveUrl(srcUrl, siteUrl);
fixText = fixText.Remove(urlIndex, endIndex - urlIndex);
fixText = fixText.Insert(urlIndex, resourceLocation.ToString());
}
}
private static Uri ResolveUrl(string srcUrl, Uri siteUrl)
{
Uri resourceLocation = null;
string pathSeperator = "/";
int sepIndex = srcUrl.IndexOf("\\");
if (sepIndex > -1 && sepIndex < 10) pathSeperator = "\\";
bool wellFormed = Uri.TryCreate(srcUrl, UriKind.RelativeOrAbsolute, out resourceLocation);
try
{
wellFormed = (resourceLocation.Scheme != "");
}
catch
{
wellFormed = false;
}
if (!wellFormed)
{
int lastSep = siteUrl.OriginalString.LastIndexOf(pathSeperator);
int rootSep = siteUrl.OriginalString.IndexOf(pathSeperator, siteUrl.Host.Length);
string resourcePath = ((srcUrl.StartsWith(pathSeperator))
? siteUrl.OriginalString.Substring(0, rootSep)
: siteUrl.OriginalString.Substring(0, lastSep + 1)) + srcUrl;
resourceLocation = new Uri(resourcePath);
}
return resourceLocation;
}
public static string GetResponseText(string url)
{
string ret = "";
StreamReader reader = null;
try
{
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
reader = new System.IO.StreamReader(response.GetResponseStream());
ret = reader.ReadToEnd();
}
finally
{
if (reader != null) reader.Close();
}
return ret;
}
}
}
To use copy method, you should override the OnPreRender
method of a page so that your contents are rendered and processed before you even copy the off-site web page.
protected override void OnPreRender(EventArgs e)
{
base.OnPreRender(e);
CopyHtmlPage("https://www.google.com/finance");
}
By placing your scripts in the correct location, you can preceed or follow the copied HTML code.
<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="SiteCopy.aspx.cs" Inherits="WebStuff.SiteCopy" %>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<script type="text/javascript">
</scripts>
<title>Site Copy Finance Grid</title>
<script type="text/javascript">
$(document).ready(function () {
var firstTable = $("table:first");
var mainRow = firstTable.find("tr:first");
var columns = mainRow.children();
columns.eq(0).hide();
columns.eq(1).hide();
searchGroup = "Fortune500"; search();
});
</script>
</head>
<body id="body" runat="server">
</body>
</html>
*[New] This simple html page shows how you could call the site copy page and insert the resulting page into a div panel. The ability to ignore duplicated scripts is shown here by passing a list of scripts already loaded by the current page.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Offsite Copy, Ajax Panel</title>
<script src='/scripts/jquery-2.1.1.js' type='text/javascript'></script>
</head>
<body>
<div id="offsite" style="background:url(loading.gif) no-repeat center center; -moz-min-width:20px; -ms-min-width:20px; -o-min-width:20px; -webkit-min-width:20px; min-width:20px;min-height:20px;">
</div>
<script type="text/javascript">
var scripts = $('script[src]');
var scriptList = "";
$.each(scripts, function (k, v) {
scriptList += $(v).attr('src') + ",";
});
jQuery.support.cors = true;
$.ajax({
type: "GET",
url: 'http://mysite.com/SiteCopy.aspx',
data: "ResidentScripts=" + encodeURIComponent(scriptList),
dataType: "html",
contentType: "text/html; charset=utf-8",
cache: false,
crossDomain: true,
isLocal: true,
success: function(data) {
$('#offsite').html(data);
},
error: function(request, error) {
alert(error + ": " + request.status);
},
complete: function () {
$('#offsite').css('background','none');
}
});
</script>
</body>
</html>
Points of Interest
Now the source site is copied and all script/style
elements are unraveled as in-line code. The server copy also takes into account if a script is in the header or the body and locates them accordingly on the duplicate. If an error occurs, the script/style element will have an [error="?"]
attribute with the description of the problem as well as a [original="?"]
attribute to indicate where that path was before the script was unraveled from a src or href location on the original site.
As a note, after I loaded and modified the page to only view what I wished to see, I had to write a JavaScript which parsed out any links and images which had relative URL references and change them to absolute references which pointed to the site which I copied from so that they would display and navigate properly. I guess I could have added server-side code to do this, but I wanted the code to be more client-side configurable after rendering. For example: You could change the images you wished by simply naming a local file the same as an image on the source site. *[New] After seeing the advantages of doing this during parse time, I changed the code to replace all links with links related to the copied site.
*[New] Code was added to process <link>
tags that don't need to be unraveled (such as a favicon reference).
*[New] After trying to allow this code to work from an ajax call, I found that when certain javascripts were duplicated on the calling page and the ajax panel the code is added to; that certain scripts would fail to load properly (jQuery in particular). I added a version independant script checker which compares script file names on the request to those on the copied site and cancel including the script if a similar one was on the calling page already. Passing a form or query variable named "ResidentScripts
" and assigning the value as a comma separateed list of script paths
*[New] I added server side script minification using the WebGrease toolkit. I added it to the project using the NuGet package manager. You can remove the minification code if you like, it really doesn't save you much load time, since it also takes time to minify the code anyway; I added it to cut down on the amount of data flowing accross the web.
I also had to debug the flow of JavaScript code on the copied site to see which variables to change in order to hard-set the search() function. This is a method for more advanced coders, but not out of the realm of intermediate coder's understanding.
The real beauty of this method is that if the site you are copying changes, the entire site is copied, so your duplicate page will display those changes in real-time; though you may have to make some changes to the javascript, it would be a minor and easier change than re-coding a scraper to look for different element names and formats.
History