Improve PDF Search by Making HTML Clones

By: Sid Steward

A complete discussion of this technique will be published in the February MacTech magazine.


Create a convenient interface between your rich PDF documents and impatient search engine users. These "HTML clones" are loose conversions of your PDF into HTML pages. Each HTML page links to the full PDF, so users can dive into the original when they're ready. Each HTML page has navigation links so the user can easily skim this light version of your document. Try our online demo.

Why bother?

  • First, Google prefers indexing HTML pages over PDF pages. In one experiment, Google appeared to index all of the pages in an HTML clone of my PDF, but failed to index the full PDF itself (it omitted about 20% of the pages). I discuss this in my online article Internet Search Engines: PDF vs. HTML.

  • Second, an HTML clone uses one HTML page per PDF page. So search engine results link users directly to the relevant pages in your document.

  • Third, you can add nifty stuff to HTML clone pages, such as navigation links, a search box, even Google Ads.

  • Finally, your online document will appear more attractive to search engine users as links to HTML pages instead of links to PDF.

Enjoy!

Online Demo

Download the HTML Clone Code

Download the pdftohtml Mac OS X Installer

The pdftohtml Home Page


January 10, 2005 ~ Copyright © 2005, Sid Steward