Loot a Blogspot Blog and Save the Contents as Markdown (for a Jekyll backup)

This script is to be taken as an inspiration :)

If you need your content to be parsed in a sophisticated way, try this gem: Upmark. Don’t forget that nokogiri’s content and text methods should therefore be changed to inner_html.

require 'nokogiri' # gem install nokogiri
require 'open-uri'
require 'date'

html = Nokogiri::HTML(open('http://any_blog-you-can-think-of3.blogspot.com'))
posts = html.css('a').map { |l| l["href"] }
# RegEx for links: http://blogname.blogspot.com/year/month/name-of-post
posts = posts.select { |e| /http:\/{2}.+\/.+\/.+\/.+/ =~ e}

# iterate over posts
posts.each do |post|
  # nokogiri / content
  content = Nokogiri::HTML(open(post))
  title = content.title.gsub("Blog Title: ", "")
  clean_title = title.gsub(/[()-,.:;?$§"'\/\\]/, '').strip
  file_title = clean_title.gsub(" ", "-")
  text = content.css('.post-body').first.content.strip
  time_published = content.css('.published').first.text

  # date related operations
  date = "#{content.css('.date-header').first.content} - #{time_published}".strip
  content_date_de = date.split(",").last.split("-").first.strip
  # dates and translatations :-/
  months_en = %w(January February March April May June July August September October November December)
  months_de = %w(Januar Februar März April Mai Juni Juli August September Oktober November Dezember)
  translatations = Hash[months_de.zip(months_en)]
  months_de.each { |m| content_date_de.gsub!(m, translatations[m]) }
  file_date = DateTime.parse(content_date_de << " " << time_published)
  pub_date = file_date.strftime("%Y-%m-%d")

  # IO for Markdown Export
  File.open("#{pub_date}-#{file_title}.md", 'w') do |file|
    content = <<~CONTENT
    ---
    layout: post
    title:  "#{title}"
    ---
    __#{date} Uhr__

    #{text}
    CONTENT
    file.write(content)
  end
end

⬅️ Read previous Read next ➡️