using hpricot to auto-populate link information

by Mike Zazaian at 2009-09-05 03:57:10 UTC in gems

a nifty way to automatically grab title or description tags from the destination page of a url

no comments 7 links

So I recently discovered a nifty feature on Facebook where you can enter a link into your status bar and it automatically populates the title and description for the web page that can be found at that link. Smart.

Concurrent to but not as a result of this, at some point, it occurred to me that it might be useful to attach links, raw links, to the articles here on do{block}.

I remembered too, reading over and over again about an HTML parser called Hpricot, developed by Ruby's resident rockstar _why the lucky stiff, which, while I've never used it before, seemed like it would get the job done for me.

And it did. Allow me to elaborate:

hpricot

Either a hip apricot, or a hyper-icot, or for your purists, just a hpricot and I should go jump off a bridge for suggesting otherwise. Whatever the case, it's a pretty useful HTML parser which, astonishingly enough, you can install like this:

$ sudo gem install --remote hpricot

Gee, wacky. This, along with Ruby's built in "open-uri" library, will give us ample tools to grab the html from a remote location, parse out that html into a Hpricot object/array, and extract specific bits of data from whichever tags our hearts desire.

Lovely.

But before we take another bite out of said hpricot, or elaborate upon why you can't find any literature about it at all except for its rdoc and this simple documentation page, let's talk a little bit about _why, the mysterious rubyist who ran away.

why, _why?

_why the lucky stiff, for those of your unfamiliar with him, has been a bit of an icon in the Ruby community for the past several years.

He published and illustrated _why's poignant guide to ruby, an insightful demi-fictional-Ruby-how-to-novella that was touted on the Rails homepage and was how I and many others were first drawn into the language and the Ruby community.

He's responsible for the try ruby! interactive Ruby demo on the Ruby homepage, and myriad open source applications such as camping, RedCloth, Shoes (which I've used extensively and love), and many others that you've heard of and have long been staples in the Ruby community.

Anyway, he's gone. Apparently at some point he kind of closed up shop and left town without a word, having shut down ALL of his web properties in the meantime, and leaving many (including us) in the dark without any of his well-illustrated, post-modern instruction.

Just telling you in case you go looking for any trace of him. All that's left are some mirrors that dutiful ruby-ites have put up of his past work. But that's it, all else has vanished. Weird.

getting on with our lives and this tutorial

Okay, so _why has taken something of an abrupt sabbatical, but he's left hpricot in his wake and that's good enough for me for the time being. The weight of the world gets to all of us at some point.

Now -- suppose that you've got a link model that you want to attach to articles, or even that you want to devise some nifty system in which links attached to or within articles automatically input the title of the destination page as the title for that link tag. Both of these would be very useful, but for either you'd probably want to integrate the action into the Link Model itself. It might look something like this:

#!/app/models/link.rb

class Link < ActiveRecord::Base

  attr_reader :html
  def html(sync=false)
    unless sync || @html
      @html = Link::HTML.new(url)
    else
      @html
    end
  end
  
  class Link::HTML
    # require html-parsing library
    require 'rubygems'
    require 'hpricot'
    require 'open-uri'
    
    attr_reader :title, :desc, :doc
    def initialize(link)
      begin
        @result = open(link)
        @doc = Hpricot(@result)
        @result.close
      rescue
        @doc = nil
      end        
    end

    def title
      unless @doc.nil?
        @title = @doc.at("title").inner_html
        return @title unless @title.nil?
      end
      return ""
    end

    def desc
      unless @doc.nil?
        @desc = @doc.search("//meta[@name='description']").first
        return @desc["content"] unless @desc.nil?
      end
      return ""
    end
  end


end

Let's look at the Link::HTML class first. The first thing that we do is to ensure that rubygems, hpricot, and the open-uri library are all or have already been required for our use. With that in place we can set up the Link::HTML#initialize method, which will grab and parse all of the html for the destination page at the link provided:

def initialize(link)
    begin
      @result = open(link)
      @doc = Hpricot(@result)
      @result.close
    rescue
      @doc = nil
    end        
  end

The open() method is actually provided by the open-uri library, which returns the destination html file for link. We then pass that file through the Hpricot() constant/method, which will parse the contents of that html file into a nifty Hpricot object/array that contains all of the page's html tags in order. This Hpricot object/array is then stored in the @doc instance variable so we can play with it later.

The method will also rescue this process from any errors that might occur in the event that (link) is not a valid url or if open() fails in grabbing the html from that destination.

But before we go into actually fetching the title and description for the page we just fetched, let's first look at the html method in the Link model that we created to access Link::HTML. The html method, or Link#html, is used to ensure that a new Link::HTML object is created only when one doesn't exist, or when it's called with the html(true) parameter. This saves us from fetching the destination html anew everytime you want to perform actions on it.

Also, and this is really, REALLY IMPORTANT, this function assumes that you've got a field called url in your Link model, so that you can fetch the destination html by calling:

@html = Link::HTML.new(url)

If you don't have a url field in your Link model, or if you don't have a Link model at all, just replace url with whatever variable contains the url for the destination page you're trying to retrieve.

Whew, I'm glad we cleared that up. With this in place you can now call @link.html from a Link object and return a formatted array of the html from the destination page. Nifty.

describing the indescribable

We've now got this massive array of html bits in the form of a Hpricot object stored in @doc which, while perhaps not immediately useful to us, will pay dividends very soon. Specifically, we'll be able to use methods within the @doc/Hpricot object to return useful bits like the title and the meta description from the page. Here are those methods from the above code:

def title
  unless @doc.nil?
    @title = @doc.at("title").inner_html
    return @title unless @title.nil?
  end
  return ""
end

def desc
  unless @doc.nil?
    @desc = @doc.search("//meta[@name='description']").first
    return @desc["content"] unless @desc.nil?
  end
  return ""
end

To get the title we use the built-in Hpricot#at method, which returns just the first instance of the tag that we pass in as a parameter, in this case "title". We use #at here because we know the page should have only one title tag, and we don't want it to work any harder than it needs to. Then, once we've got that title tag, we use the inner_html method to just grab the text within the tag, and not the tags themselves. Simple.

Getting the description, or desc as I've named the method above, is a little more convoluted. As opposed to there being one single tag, there are usually MANY meta tags, so we're going to have to use Hpricot#search instead of Hpricot#at. Also, we're going to have to get the tag where the name attribute is "description". To do this Hpricot uses the strange but reasonably succinct <a href="http://www.w3schools.com/XPath/default.asp">XPath syntax</a>. I'd never really used XPath before, but it's easy enough to understand and use.</p> <p>Once you've accessed the proper meta tag and assigned it to <strong>@desc</strong>, we then return @desc["content"] instead of #inner_html because all we care about is the "content" attribute of the "description" meta tag, as <meta /> tags aren't closable anyway.</p> <p>For reference, you can also use a css selector syntax and even a path constructed of Ruby symbols to access items like these in the Hpricot object. To get more info on this process and/or the other myriad capabilities of Hpricot, I suggest that you read through the Hpricot Rdoc files, or checkout <a href="http://github.com/whymirror/hpricot/blob/2c961095954d5aaa5c046f4c773c62c3d5902ef4/README">the README file</a>, hosted as a mirror of _why's original hpricot project on github.</p> <p>Now that we've got the <strong>title</strong> and <strong>desc</strong> methods squared away, we can now grab the title or meta description for the link's destination page by calling <strong>@link.html.title</strong> or <strong>@link.html.desc</strong> respectively. Simple, easy, powerful.</p> <h2>the (marginally) grand finale</h2> <p>Okay, so we've smartly relegated all of this business logic into our Link model so that we can keep our controllers uncluttered, and also to prevent the need to repeat any of this elsewhere in our application (vewy, vewy DRY).</p> <p>Now -- you're obviously a wise and cunning individual or you wouldn't have otherwise found this article, let alone continued to read through the plethora of tangents and diatribes with which I've cluttered it. As such, I'm sure that you've already dreamed up a hundred different ways to use this, and more power to you.</p> <p>However! Because I'm an unrelenting narcissist, I'm going to show you what I did with it, to perhaps urge along your own creative juices, or, at the very worst, monopolize and waste your precious time. Here's a snippet of my Links controller:</p> <div class="CodeRay"> <div class="code"><pre><span class="doctype">#!/app/controllers/links_controller.rb</span> <span class="keyword">def</span> <span class="function">create</span> <span class="instance-variable">@article</span> = <span class="constant">Article</span>.find params[<span class="symbol">:article_id</span>] <span class="instance-variable">@link</span> = <span class="instance-variable">@article</span>.links.build( params[<span class="symbol">:link</span>].merge(<span class="symbol">:user_id</span> => session[<span class="symbol">:user_id</span>])) <span class="comment"># Grab title and description from link landing page unless the html is invalid</span> <span class="keyword">unless</span> <span class="instance-variable">@link</span>.html.doc.nil? <span class="instance-variable">@link</span>.title = <span class="instance-variable">@link</span>.html.title <span class="keyword">unless</span> <span class="instance-variable">@link</span>.title.size > <span class="integer">0</span> <span class="instance-variable">@link</span>.description = <span class="instance-variable">@link</span>.html.desc <span class="keyword">unless</span> <span class="instance-variable">@link</span>.description.size > <span class="integer">0</span> <span class="keyword">else</span> flash.now[<span class="symbol">:warning</span>] = <span class="string"><span class="delimiter">"</span><span class="content">the page at url </span><span class="char">\"</span><span class="inline"><span class="inline-delimiter">#{</span>params[<span class="symbol">:url</span>]<span class="inline-delimiter">}</span></span><span class="char">\"</span><span class="content"> could not be loaded</span><span class="delimiter">"</span></span> <span class="keyword">end</span> <span class="keyword">if</span> <span class="instance-variable">@link</span>.save redirect_to <span class="instance-variable">@article</span> <span class="keyword">else</span> render <span class="symbol">:action</span> => <span class="string"><span class="delimiter">"</span><span class="content">new</span><span class="delimiter">"</span></span> <span class="keyword">end</span> <span class="keyword">end</span></pre></div> </div> <p>As you'll see, it's pretty standard. The Article model builds the actual link (it's polymorphic in my application -- belongs_to :linkable), and assigns the active user's id to it. Then, as a surge of fresh, Atlantic air rushes through the treetops, and an ascension of larks (that's a proper collective noun for larks, also "exaltation") scatters from the trees, our neato Link#html method automatically fetches and populates the fields for the link title and description unless they've already been filled in by the user.</p> <p>Also, in the event that no url was provided, or that the url couldn't be opened by the open-uri library, we provide a <strong>flash.now[:warning]</strong> message to alert the user as to what went wrong. Done.</p> <h2>in memoriam of this article, and _why</h2> <p>Wow, that was touching.</p> <p>You might even want to reitereate that functionality in the <strong>edit</strong> method of your LinksController, but who am I to boss you around? Obviously there's a lot of opportunity here to use Ajax to implement these methods on the fly, or whatever you want to do. We've already established that you're the genius, and I'm just some weird solopsistic Ruby demagogue trying to indoctrinate anybody willing to lend an ear. In the meantime you can save yourself, and your application's users the hassle of having to automatically populate title, description, or any other fields that Hpricot and open-uri can handle for them.</p> <p>Indeed, _why is a great man, and his Hpricot a robust and powerful tool, and clearly there's a great deal more to say on both of them, but that'll be for another day.</p> <div id='links'> <h3>7 links</h3> <div class='linkItem'> <a href="http://github.com/whymirror/hpricot/tree/master">whymirror's hpricot at master - GitHub</a> <div class='linkDescription'>A swift, liberal HTML parser with a fantastic library</div> </div> <div class='linkItem'> <a href="http://tryruby.sophrinix.com/">try ruby! (in your browser)</a> <div class='linkDescription'>an interactive ruby tutorial developed by _why for the Ruby nascents </div> </div> <div class='linkItem'> <a href="http://hpricot.com/">hpricot.com</a> <div class='linkDescription'>an odd little instructional site for hpricot</div> </div> <div class='linkItem'> <a href="http://en.wikipedia.org/wiki/Why_the_lucky_stiff">why the lucky stiff - Wikipedia, the free encyclopedia</a> <div class='linkDescription'>a wikipedia entry on _why the lucky stiff</div> </div> <div class='linkItem'> <a href="http://mislav.uniqpath.com/poignant-guide/book/">why’s (poignant) guide to ruby</a> <div class='linkDescription'>a mirror of _why's original poignant guide site</div> </div> <div class='linkItem'> <a href="http://www.w3schools.com/XPath/default.asp">XPath Tutorial at W3C Schools</a> <div class='linkDescription'>a tutorial on using the XPath syntax implemented by Hpricot</div> </div> <div class='linkItem'> <a href="http://www.rubyinside.com/why-the-lucky-stiff-is-missing-2278.html">“Why The Lucky Stiff” Is Missing</a> </div> </div> </div> </div> </div> <div id='comments'></div> <h2 class='center'>no comments</h2> <h5>Comments closed</h5> </div> <div class='black' id='secondary'> <div id='search'><form action="/articles/search" method="post"><div style="margin:0;padding:0;display:inline"><input name="authenticity_token" type="hidden" value="37YSHB5ge33CCKvnAY7eRlPbkKwsGj7ExyzOWaQ7PjI=" /></div> <span class="textFields"> <input id="search_field" name="search" type="text" /> </span> <span class="submit"> <input type="submit" value="Search Articles" /> </span> </form> </div> <div class='partial'> <h4>related articles</h4> <h5> <span class='white'> <a href="/articles/a-lightweight-powerful-search-engine-for-rails">a lightweight, powerful search engine for rails</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> at 2009-08-24 21:18:00 UTC </span> </h5> <h5> <span class='white'> <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id">seo-friendly urls for your rails app with friendly_id</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> at 2009-08-07 23:06:00 UTC </span> </h5> <h5> <span class='white'> <a href="/articles/adding-recaptcha-to-comments-in-your-rails-app">adding recaptcha to comments in your rails app</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> at 2009-09-15 20:07:38 UTC </span> </h5> <h5> <span class='white'> <a href="/articles/a-whistle-stop-tour-of-syntax-highlighting-and-markdown-solutions-for-rails">a whistle-stop tour of syntax highlighting and markdown solutions for rails</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> at 2009-08-14 05:02:03 UTC </span> </h5> <h5> <span class='white'> <a href="/articles/using-bash-aliases-to-simplify-your-existence">using bash aliases to simplify your existence</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> at 2009-10-10 15:28:44 UTC </span> </h5> </div> <div class='partial'> <h4>popular articles</h4> <h5> <span class='white'> <a href="/articles/the-fastest-way-to-concatenate-strings-and-arrays-in-ruby">The Fastest Way to Concatenate Strings and Arrays in Ruby</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> was viewed 32067 times </span> <span class='blue'> <a href="/articles/the-fastest-way-to-concatenate-strings-and-arrays-in-ruby#comments">and has 0 comments</a> </span> </h5> <h5> <span class='white'> <a href="/articles/a-whistle-stop-tour-of-syntax-highlighting-and-markdown-solutions-for-rails">A Whistle-Stop Tour of Syntax Highlighting and Markdown Solutions for Rails</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> was viewed 26868 times </span> <span class='blue'> <a href="/articles/a-whistle-stop-tour-of-syntax-highlighting-and-markdown-solutions-for-rails#comments">and has 3 comments</a> </span> </h5> <h5> <span class='white'> <a href="/articles/using-hpricot-to-auto-populate-link-information">using hpricot to auto-populate link information</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> was viewed 22125 times </span> <span class='blue'> <a href="/articles/using-hpricot-to-auto-populate-link-information#comments">and has 0 comments</a> </span> </h5> <h5> <span class='white'> <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id">seo-friendly urls for your rails app with friendly_id</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> was viewed 21766 times </span> <span class='blue'> <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id#comments">and has 14 comments</a> </span> </h5> <h5> <span class='white'> <a href="/articles/how-to-modify-default-setters-and-getters-in-rails-models">How To Modify Default Setters and Getters in Rails Models</a> </span> <span class='yellow'> by <a href="/users/mkzzn">Mike Zazaian</a> </span> <span class='green'> was viewed 17844 times </span> <span class='blue'> <a href="/articles/how-to-modify-default-setters-and-getters-in-rails-models#comments">and has 0 comments</a> </span> </h5> </div> <div class='partial'> <h4>latest links</h4> <h5> <span class='white'> <a href="http://help.github.com/multiple-keys/">Help.GitHub - Multiple SSH keys</a> </span> <span class='green'> The article from github help mirroring this process </span> </h5> <h5> <span class='white'> <a href="http://ozmm.org/">ones zeros majors and minors</a> </span> <span class='green'> ones zeros majors and minors: esoteric adventures in solipsism, by chris wanstrath </span> </h5> <h5> <span class='white'> <a href="http://activescaffold.com/">ActiveScaffold</a> </span> <span class='green'> A Ruby on Rails plugin for dynamic, AJAX CRUD interfaces </span> </h5> </div> <div class='partial'> <h4>latest comments</h4> <h5> <span class='white'> <a href="/articles/creating-an-extensible-user-favorites-system-in-rails#comment_206">when i try to run the favorite migration i get a MYSQL error "Can't create table" Mysql::Error: Can't create table 'favorit...</a> </span> <span class='yellow'> by mike mitchell </span> <span class='blue'> on <a href="/articles/creating-an-extensible-user-favorites-system-in-rails">Creating an Extensible User Favorites System in Rails</a> </span> </h5> <h5> <span class='white'> <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id#comment_91">one remark: map.question ':exam_id/question/:id', :controller => 'question', :action => 'show'</a> </span> <span class='yellow'> by Voldy </span> <span class='blue'> on <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id">seo-friendly urls for your rails app with friendly_id</a> </span> </h5> <h5> <span class='white'> <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id#comment_90">Josh, in routes.rb: map.question ':exam_id/question/:id', :controller => 'question', :action => 'index' in your view: ques...</a> </span> <span class='yellow'> by Voldy </span> <span class='blue'> on <a href="/articles/seo-friendly-urls-for-your-rails-app-with-friendly_id">seo-friendly urls for your rails app with friendly_id</a> </span> </h5> </div> </div> <div class='gray' id='tertiary'> <span class='white'></span> <h4> <span class='white'> <a href="/login">login</a> </span> </h4> <div id='miniLogin'> <form action="/session" method="post"><div style="margin:0;padding:0;display:inline"><input name="authenticity_token" type="hidden" value="37YSHB5ge33CCKvnAY7eRlPbkKwsGj7ExyzOWaQ7PjI=" /></div> <div class='textFields'> <label>username</label> <input id="login" name="login" type="text" /> <label>password</label> <input id="password" name="password" type="password" /> </div> <div class='submit'> <input name="commit" type="submit" value="login" /> </div> </form> <span class='yellow'> <a href="/register">register</a> <a href="/activate">activate</a> <a href="/reset">reset</a> </span> </div> <div id='feeds'> <span class='blue'> <h4>feeds</h4> <a href="/articles/feed">articles/rss</a> </span> </div> <h4> <span class='white'> <a href="/topics">topics</a> </span> </h4> <div id='allTopics'> <span class='green'> <a href="/topics/editorial" class="topic">editorial</a> <a href="/topics/templating" class="topic">templating</a> <a href="/topics/plugins" class="topic">plugins</a> <a href="/topics/rails" class="topic">rails</a> <a href="/topics/news" class="topic">news</a> <a href="/topics/syntax" class="topic">syntax</a> <a href="/topics/versioning" class="topic">versioning</a> <a href="/topics/gems" class="topic">gems</a> <a href="/topics/unix" class="topic">unix</a> </span> </div> <div id='staff'> <h4>staff</h4> <span class='yellow'> <div class='role'> editor </div> <div class='user'><a href="/users/mkzzn">mike zazaian</a></div> </span> </div> <div id='about'> <h4>about</h4> <span class='blue'> <p> doblock focuses on ruby, rails, and all things that can help ruby and/or rails programmers hone their skills. </p> <p> Techniques, tutorials, news, and even free open-source applications, doblock seeks to fill in the cracks of the ruby/rails blogosphere. </p> <p> <span class='gray'> doblock v. 0.10.1 powered by Rails </span> </p> </span> </div> </div> </div> <div id='footer'> <div class='nav'> <ul id='primary'> <li><a href="/articles" class="whiteActive">articles</a></li> <li><a href="/topics" class="white">topics</a></li> <li><a href="/login" class="white">login</a></li> </ul> </div> </div> <script src="/javascripts/analytics/one.js?1329089373" type="text/javascript"></script> <script src="/javascripts/analytics/two.js?1329089373" type="text/javascript"></script> </body> </html>