I finally did it: I bought LINQPad's Code Completion so I could write C# scripts easily. Now, sure, I could have used C# script for free, but... wait, why didn't I do that?
Anyway, in my previous post I explained in detail how to set up a blog on GitHub. Now it's time to convert my old blogger blog to a shiny new GitHub format.
To export your blog from Blogger, log into Blogger, go to your blog control panel, go to the Settings | Other tab, and click "Export Blog".
You get an XML file that is basically in Atom format. It's hard to understand because it doesn't have any line breaks (apart from the line breaks in your blog template, which just serve to distract you). It's a <feed> root element containing a bunch of <entry> elements, some of which contain your posts and others of which contain metadata.
Here's the code I came up with to export to GitHub. Just paste this into LINQPad or whatever, change the filepath
to point to your xml file, and run it!
void Main()
{
string filepath = @"C:\Downloads\Blog.xml";
string text = File.ReadAllText(filepath);
XDocument doc = XDocument.Parse(text);
var _ = XNamespace.Get("http://www.w3.org/2005/Atom");
var app = XNamespace.Get("http://purl.org/atom/app#");
var posts = doc.Root.Elements(_+"entry")
.Where(entry => entry.Element(_+"category").Attribute("term").ToString().Contains("#post"))
.Where(entry => !entry.Descendants(app+"draft").Any(draft => draft.Value != "no"));
var outfolder = Path.Combine(Path.GetDirectoryName(filepath), Path.GetFileNameWithoutExtension(filepath));
Directory.CreateDirectory(outfolder);
foreach (var entry in posts)
{
DateTime published = DateTime.Parse(entry.Element(_+"published").Value);
DateTime updated = DateTime.Parse(entry.Element(_+"updated").Value);
string title = entry.Element(_+"title").Value;
string content = entry.Element(_+"content").Value;
string type = entry.Element(_+"content").Attribute("type").Value ?? "html";
XElement empty = new XElement("empty");
XAttribute emptA = new XAttribute("empty","");
string originalLink = ((entry.Elements(_+"link")
.FirstOrDefault(e => e.Attribute("rel").Value == "alternate") ?? empty)
.Attribute("href") ?? emptA).Value;
string outFileName = string.Format("{0:yyyy-MM-dd}-{1}.{2}", published,
Path.GetFileNameWithoutExtension(originalLink), type);
var outPath = Path.Combine(outfolder, outFileName);
if (content.Count(c => c == '\n') <= 3)
content = AddLineBreaks(content);
using (StreamWriter output = File.CreateText(outPath)) {
output.WriteLine("---");
output.WriteLine("title: \"{0}\"", title);
output.WriteLine("layout: post");
output.WriteLine("# Pulled from Blogger. Last updated there on: {0:yyyy-MM-dd}", updated);
output.WriteLine("---");
if (originalLink != "")
output.WriteLine("<small><p><i>This post was imported from "+
"<a href='{0}'>blogspot</a>.</i></p></small>", originalLink);
output.WriteLine("");
output.Write(content);
output.WriteLine("");
}
}
}
It will create a folder named after the xml file, and inside that folder it will create an html file for each post, like this:
2007-09-03-hello-no-one.html
2011-07-05-why-wpf-sucks.html
2012-06-07-smart-tabs.html
2013-05-28-onward.html
These filenames are in the correct format for Jekyll, so if you're moving to GitHub, just move all these files to your /_posts
folder, commit, and you're done! If you want "proper" HTML files, modify the code above to produce proper code like <html><head>...</head>...
instead of Jekyll front-matter.
Oh, by the way, Blogger's exported HTML contains no line breaks at all in your posts. So I wrote this little method to add some line breaks at appropriate spots:
string AddLineBreaks(string content)
{
var sb = new StringBuilder(content.Length + 100);
bool pre = false, fail;
for (UString rest = content; !rest.IsEmpty;) {
if (rest.StartsWith("<pre")) pre = true;
if (rest.StartsWith("</pre")) pre = false;
bool s;
if ((s = rest.StartsWith("<br />")) || rest.StartsWith("<br/>")) {
sb.Append(pre ? "\n" : "<br/>\n");
rest = rest.Substring(s ? 6 : 5);
continue;
}
if (rest.StartsWith("<li>") || rest.StartsWith("<p>") || rest.StartsWith("<tr>") || rest.StartsWith("<pre>") || rest.StartsWith("<blockquote>") || rest.StartsWith("<img"))
sb.Append('\n');
if (rest.StartsWith("</ul>") || rest.StartsWith("</ol>") || rest.StartsWith("</blockquote>"))
sb.Append('\n');
char c = (char)rest.PopFront(out fail);
if (!fail) sb.Append(c);
}
return sb.ToString();
}
This relies on UString
in my Loyc.Essentials.dll
library though (it's a kind of string slice). If you want to use this function, download LoycCore from NuGet.
The code was good enough for me, and hopefully it will be good enough for you... but I don't know if images work (certainly Blogger doesn't include images in the export file). Note: by default the HTML in blogspot has an "implicit" line breaks feature in which newlines are converted to <br/>
for you. I turned off that feature because it often screws up formatting of nontrivial posts; if you left that option on, I'm not sure if the HTML that blogger gives to you in the XML file includes those auto-inserted <br/>
s.