I find it much easier to learn a new API while solving some concrete problems instead of trying to absorb everything that's possible by reading through the API without ever using it for anything. The latter attempts always result in most of the features of the API being dropped on the floor and forgotten. I still find it useful to read through documentation to learn the lay of the land, but without a goal in mind to pursue writing some actual code and solving a real problem, it ends up being a wasted effort.
With that in mind, it looks like we have the makings of a useful learning experience here. We have a potentially new API to learn; we have some specific problems to solve using this API; and we can learn more about the more general possibilities of the API in the process so that we could use it again in the future. The rest of this post will keep a running account of how I go about breaking up these problems of downloading, and word counting my blog posts into small, manageable pieces and using the Google Blogger API to solve those problems.
First things first, we should find some documentation on this API that I'm pretty sure exists and figure out how to start using it. Doing a Google search for "google blogger API" brings up just what we need as the first hit: the Blogger API v3.0. After reading through the getting started section, things are looking promising and pretty straightforward. There's even a Ruby Gem for the Google API, along with support for most popular languages. I wanted to do this with a Ruby script, so this is perfect. I can install the Gem with this command:
$ gem install google-api-client
Next, we move on to the Using the API section, where we learn about how to access posts on a blog with an API key. I should only be doing read operations on a public blog, even though it's mine, so I shouldn't need authorization with OAuth 2.0, only an API key. These API keys are easy to generate from the Google credentials page of your Google account. After generating an API key, I restricted it for use only by my own IP address.Now it's time to start playing around with this API. The Ruby Google API client is actually the full suite of Google APIs. That means we have access to Drive, Calendar, Analytics, DfareportingV2_6 (whatever that is), and much MUCH more, but we're going to focus on Blogger. It's nice to know I have access to all of the other Google tools from this same API, though. Well, let's start with trying to download a post. After looking through the documentation a bit, I fired up an interactive Ruby session and started experimenting with this code:
require 'google/apis/blogger_v3'
blogger = Google::Apis::BloggerV3::BloggerService.new
blogger.key = my_api_key
post = blogger.get_post(blog_id, post_id)
The blog_id and post_id can be found in the URL when you go to edit a post in Blogger. I'm not going to tell you mine because they're kind of private, I think. Unfortunately, that returned an error:Google::Apis::ClientError: accessNotConfigured: Access Not Configured. Blogger API has not been used in project 'my_project_id' before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/blogger.googleapis.com/overview?project=my_project_id then retry.
I followed the handy-dandy link that the error provided, and it led me back to my credentials dashboard where I enabled the Blogger API. After trying again, I get the content of my actual blog post. Sweet! Next, I want to figure out how to get a list of all of my posts, so that I can iterate through them and download them all. Luckily, there's a simple method for that:posts = blogger.list_posts(blog_id, max_results: 200)
This method actually does more than just return a list of the post IDs. It returns the content of each post as well. The returned posts object is full of all kinds of data that I won't get into here, but you can look it all up in Google's relatively good documentation. I can iterate through all of the posts, doing whatever I need to do, without needing to download each one separately. If I wanted to download just the metadata for each post, I could set the parameter fetch_bodies to false in the method call. I set the max_results to 200 because I know I have 188 published posts at this point, so I'll get them all back in one shot.The next thing I want to do is write all of this content to my local drive, so I need to iterate through each post and write the content to a file. However, I also want to download any pictures that were linked in each post, and those pictures weren't part of the post download. So I'm going to have to create a folder for each post, write the post content to a file, search the content for any images, and download those images to the same folder. Let's figure out how to download an image in Ruby first. The rest of the plan should be pretty easy.
After a quick Google search, it seems that downloading an image using Ruby is pretty easy, too. We simply scan for the image name, use open-uri to download it, and then stream it to a file:
download = open(url)
IO.copy_stream(download, filename)
We just have to fill in the details of the url and filename by scanning through each post content, looking for the image sources. Here's a quick cut at something that should work:posts.items.each do |post|
dir_name = post.published.strftime('%Y%m%d-') + post.title.slice(0,50).gsub(' ', '-')
Dir.mkdir(dir_name)
File.write(dir_name + '/content.html', post.content)
images = post.content.scan(/src="(http\S+\.jpe?g)"/).flatten
images += post.content.scan(/src="(http\S+\.png)"/).flatten
images.each do |image|
download = open(image)
IO.copy_stream(download, dir_name + '/' + File.basename(image))
end
end
After trying to run it and inspecting the output, it looks like this indeed did the job. That's awesome. Google API's awesome! Ruby's awesome!! In less than twenty lines of code I essentially backed up my blog in a way that would make it more portable to any other platform. Each post is put in its own directory that's labeled with the date that it was published and the first 50 characters of the title, making for easier searching later on. Granted, there are some things that would still need to be done, like fixing all of the image source URLs to point to the new images if I actually moved the blog, but that would have to be done specifically for wherever they were moved to anyway. The important part is that I have a local copy of the images.This code is still pretty rough and not at all safe for general use. I wouldn't recommend using it for an automated backup application. Let's quickly look over things that should be improved to harden it a bit. First, the directory name can potentially have a lot of ugly characters in it. It should remove colons, apostrophes, and other such nonsense for a nicer name. Then, the directory is created without checking if it already exists, right in the current working directory. As it stands, this script should only be run in an empty directory so that the new directories are guaranteed to be created without issue. That minor gotcha could be improved.
Next, the regular expression I used for finding image source URLs is fairly crazy. It will match on anything that starts with 'http', ends with '.jpg', and doesn't contain any whitespace characters. A regex like that is sure to cause all kinds of trouble with more arbitrary input, and I should use a more constrained regex for URLs. This was a quick-and-dirty script for a known and trusted input, though, so I figured I could take a shortcut. Finally, along the same line, I shouldn't use the image URLs directly. I should at least try to parse it or encode/decode it to make sure it's valid and safe, but again, I'm pulling it out of my own blog content so I figured it was alright.
Now that I've got all of this content from my blog, it's time to satisfy my curiosity. How much have I written in the past four years? This little script should give me a rough estimate:
posts.items.reduce(0) do |word_count, post|
word_count + post.content.gsub(/<[^>]+>/,'').split.count
end
This snippet of code accumulates a simple count of all words separated by whitespace in all of the post content. It also removes all HTML tags before doing each word count on a post so that I don't erroneously count large strings of stuff that the Google Blog editor has put in tags. It does leave in JavaScript code that I wrote, and I figured that's fine because I did write it after all, even if it's viewed instead of read. After running the script, I come up with nearly 385,000 words. Not too shabby for four years of writing. I bet if I had kept up the once-per-week writing schedule I had going for the first two years, I would be over half a million words by now. There's always a higher target to shoot for, I guess.That's probably enough for a Barely Adequate Guide. We explored how to access the Google Blogger API in Ruby code using an API key, download an entire blog's content and images, and do a word count of the content. The Google APIs have tons more features, and give you access to the whole suite of Google tools from within your application code. If you get into accessing your private stuff through the API, like Drive or Analytics, you'll need to use OAuth 2.0 authentication. The easiest way to do that in Ruby is using the googleauth library. It's pretty amazing what's available now. Have fun exploring.
No comments:
Post a Comment