using hpricot to auto-populate link information
a nifty way to automatically grab title or description tags from the destination page of a url
no comments 7 linksSo I recently discovered a nifty feature on Facebook where you can enter a link into your status bar and it automatically populates the title and description for the web page that can be found at that link. Smart.
Concurrent to but not as a result of this, at some point, it occurred to me that it might be useful to attach links, raw links, to the articles here on do{block}.
I remembered too, reading over and over again about an HTML parser called Hpricot, developed by Ruby's resident rockstar _why the lucky stiff, which, while I've never used it before, seemed like it would get the job done for me.
And it did. Allow me to elaborate:
hpricot
Either a hip apricot, or a hyper-icot, or for your purists, just a hpricot and I should go jump off a bridge for suggesting otherwise. Whatever the case, it's a pretty useful HTML parser which, astonishingly enough, you can install like this:
$ sudo gem install --remote hpricot
Gee, wacky. This, along with Ruby's built in "open-uri" library, will give us ample tools to grab the html from a remote location, parse out that html into a Hpricot object/array, and extract specific bits of data from whichever tags our hearts desire.
Lovely.
But before we take another bite out of said hpricot, or elaborate upon why you can't find any literature about it at all except for its rdoc and this simple documentation page, let's talk a little bit about _why, the mysterious rubyist who ran away.
why, _why?
_why the lucky stiff, for those of your unfamiliar with him, has been a bit of an icon in the Ruby community for the past several years.
He published and illustrated _why's poignant guide to ruby, an insightful demi-fictional-Ruby-how-to-novella that was touted on the Rails homepage and was how I and many others were first drawn into the language and the Ruby community.
He's responsible for the try ruby! interactive Ruby demo on the Ruby homepage, and myriad open source applications such as camping, RedCloth, Shoes (which I've used extensively and love), and many others that you've heard of and have long been staples in the Ruby community.
Anyway, he's gone. Apparently at some point he kind of closed up shop and left town without a word, having shut down ALL of his web properties in the meantime, and leaving many (including us) in the dark without any of his well-illustrated, post-modern instruction.
Just telling you in case you go looking for any trace of him. All that's left are some mirrors that dutiful ruby-ites have put up of his past work. But that's it, all else has vanished. Weird.
getting on with our lives and this tutorial
Okay, so _why has taken something of an abrupt sabbatical, but he's left hpricot in his wake and that's good enough for me for the time being. The weight of the world gets to all of us at some point.
Now -- suppose that you've got a link model that you want to attach to articles, or even that you want to devise some nifty system in which links attached to or within articles automatically input the title of the destination page as the title for that link tag. Both of these would be very useful, but for either you'd probably want to integrate the action into the Link Model itself. It might look something like this:
#!/app/models/link.rb class Link < ActiveRecord::Base attr_reader :html def html(sync=false) unless sync || @html @html = Link::HTML.new(url) else @html end end class Link::HTML # require html-parsing library require 'rubygems' require 'hpricot' require 'open-uri' attr_reader :title, :desc, :doc def initialize(link) begin @result = open(link) @doc = Hpricot(@result) @result.close rescue @doc = nil end end def title unless @doc.nil? @title = @doc.at("title").inner_html return @title unless @title.nil? end return "" end def desc unless @doc.nil? @desc = @doc.search("//meta[@name='description']").first return @desc["content"] unless @desc.nil? end return "" end end end
Let's look at the Link::HTML class first. The first thing that we do is to ensure that rubygems, hpricot, and the open-uri library are all or have already been required for our use. With that in place we can set up the Link::HTML#initialize method, which will grab and parse all of the html for the destination page at the link provided:
def initialize(link) begin @result = open(link) @doc = Hpricot(@result) @result.close rescue @doc = nil end end
The open() method is actually provided by the open-uri library, which returns the destination html file for link. We then pass that file through the Hpricot() constant/method, which will parse the contents of that html file into a nifty Hpricot object/array that contains all of the page's html tags in order. This Hpricot object/array is then stored in the @doc instance variable so we can play with it later.
The method will also rescue this process from any errors that might occur in the event that (link) is not a valid url or if open() fails in grabbing the html from that destination.
But before we go into actually fetching the title and description for the page we just fetched, let's first look at the html method in the Link model that we created to access Link::HTML. The html method, or Link#html, is used to ensure that a new Link::HTML object is created only when one doesn't exist, or when it's called with the html(true) parameter. This saves us from fetching the destination html anew everytime you want to perform actions on it.
Also, and this is really, REALLY IMPORTANT, this function assumes that you've got a field called url in your Link model, so that you can fetch the destination html by calling:
@html = Link::HTML.new(url)
If you don't have a url field in your Link model, or if you don't have a Link model at all, just replace url with whatever variable contains the url for the destination page you're trying to retrieve.
Whew, I'm glad we cleared that up. With this in place you can now call @link.html from a Link object and return a formatted array of the html from the destination page. Nifty.
describing the indescribable
We've now got this massive array of html bits in the form of a Hpricot object stored in @doc which, while perhaps not immediately useful to us, will pay dividends very soon. Specifically, we'll be able to use methods within the @doc/Hpricot object to return useful bits like the title and the meta description from the page. Here are those methods from the above code:
def title unless @doc.nil? @title = @doc.at("title").inner_html return @title unless @title.nil? end return "" end def desc unless @doc.nil? @desc = @doc.search("//meta[@name='description']").first return @desc["content"] unless @desc.nil? end return "" end
To get the title we use the built-in Hpricot#at method, which returns just the first instance of the tag that we pass in as a parameter, in this case "title". We use #at here because we know the page should have only one title tag, and we don't want it to work any harder than it needs to. Then, once we've got that title tag, we use the inner_html method to just grab the text within the tag, and not the tags themselves. Simple.
Getting the description, or desc as I've named the method above, is a little more convoluted. As opposed to there being one single
Once you've accessed the proper meta tag and assigned it to @desc, we then return @desc["content"] instead of #inner_html because all we care about is the "content" attribute of the "description" meta tag, as tags aren't closable anyway.
For reference, you can also use a css selector syntax and even a path constructed of Ruby symbols to access items like these in the Hpricot object. To get more info on this process and/or the other myriad capabilities of Hpricot, I suggest that you read through the Hpricot Rdoc files, or checkout the README file, hosted as a mirror of _why's original hpricot project on github.
Now that we've got the title and desc methods squared away, we can now grab the title or meta description for the link's destination page by calling @link.html.title or @link.html.desc respectively. Simple, easy, powerful.
the (marginally) grand finale
Okay, so we've smartly relegated all of this business logic into our Link model so that we can keep our controllers uncluttered, and also to prevent the need to repeat any of this elsewhere in our application (vewy, vewy DRY).
Now -- you're obviously a wise and cunning individual or you wouldn't have otherwise found this article, let alone continued to read through the plethora of tangents and diatribes with which I've cluttered it. As such, I'm sure that you've already dreamed up a hundred different ways to use this, and more power to you.
However! Because I'm an unrelenting narcissist, I'm going to show you what I did with it, to perhaps urge along your own creative juices, or, at the very worst, monopolize and waste your precious time. Here's a snippet of my Links controller:
#!/app/controllers/links_controller.rb def create @article = Article.find params[:article_id] @link = @article.links.build( params[:link].merge(:user_id => session[:user_id])) # Grab title and description from link landing page unless the html is invalid unless @link.html.doc.nil? @link.title = @link.html.title unless @link.title.size > 0 @link.description = @link.html.desc unless @link.description.size > 0 else flash.now[:warning] = "the page at url \"#{params[:url]}\" could not be loaded" end if @link.save redirect_to @article else render :action => "new" end end
As you'll see, it's pretty standard. The Article model builds the actual link (it's polymorphic in my application -- belongs_to :linkable), and assigns the active user's id to it. Then, as a surge of fresh, Atlantic air rushes through the treetops, and an ascension of larks (that's a proper collective noun for larks, also "exaltation") scatters from the trees, our neato Link#html method automatically fetches and populates the fields for the link title and description unless they've already been filled in by the user.
Also, in the event that no url was provided, or that the url couldn't be opened by the open-uri library, we provide a flash.now[:warning] message to alert the user as to what went wrong. Done.
in memoriam of this article, and _why
Wow, that was touching.
You might even want to reitereate that functionality in the edit method of your LinksController, but who am I to boss you around? Obviously there's a lot of opportunity here to use Ajax to implement these methods on the fly, or whatever you want to do. We've already established that you're the genius, and I'm just some weird solopsistic Ruby demagogue trying to indoctrinate anybody willing to lend an ear. In the meantime you can save yourself, and your application's users the hassle of having to automatically populate title, description, or any other fields that Hpricot and open-uri can handle for them.
Indeed, _why is a great man, and his Hpricot a robust and powerful tool, and clearly there's a great deal more to say on both of them, but that'll be for another day.
7 links
no comments
post comment
markdown basics
| **bold** __bold__ | [link](http://link.com "link") | * unordered list item |
| *italic* _italic_ | ##h2 heading | 1. ordered list item |
| > blockquote | ####h4 heading | <code>@ruby</code> |