So much information is stored in PDF, yet a little experimentation suggestes that PDF gets short shrift from Google. So let's dig a little deeper in this PDF vs. HTML Google shootout.
I've been wondering just how well Google indexes PDF content. So I experimented by performing some Google searches on the PDF pages hosted at http://www.pdfhacks.com/eno/. That site hosts a single, 216-page PDF (BE.pdf). It also hosts this same PDF as a sequence of single PDF pages linked together using HTML frames (here).
If you search the full PDF using its pdfportal page, you get about as good a search as possible. Searching BE.pdf for "beatles" using pdfportal yields 25 pages, or 40 specific occurrences. That's my baseline.
Now, let's see what Google turns up. I'll search the full PDF, BE.pdf, using these search terms:
site:pdfhacks.com filetype:pdf inurl:BE.pdf beatles
Google has no trouble locating BE.pdf. The problem is that it is only one hit: BE.pdf. (Actully, it gave me three hits, each the same BE.pdf but at different URLs). To clarify my search for "beatles" in this 216-page PDF, I click the "View as HTML" link provided by Google. This also highlights my search term. Turns out that Google's HTML cache of this PDF stops at page 62. So, it only catches the first 11 occurrences of "beatles."
I wonder: did Google index any of this full PDF after page 62? A search for "beatles provided the most" (page 50) yields my BE.pdf. A search for "manner of the beatles" (page 160) does not. "showing the beatles that" (page 87) also fails. "john lennon of the beatles" (page 44) works. "breakup of the beatles" (page 169) fails. So it seems that Google indexed less than 1/3 of BE.pdf. That's funny.
Now let's try searching for "beatles" within the directory of single PDF pages:
site:pdfhacks.com filetype:pdf inurl:skinned_php "beatles"
That search yields six PDF pages in Google (44, 101, 119, 125, 207, and 216). Two of those pages show "beatles" twice, so that makes eight hits total. Let's take the phrase "different shades of meaning" from page 119 and perform a broader search:
site:pdfhacks.com filetype:pdf "different shades of meaning"
Consistent with our earlier experiments, this catches our single page 119, but it does not catch the full PDF, BE.pdf.
So it didn't matter whether our PDF document was published in full or in pieces. In both cases, Google indexes only part of the document.
For my next set of experiments, I took BE.pdf and created two parallel editions: the burst PDF edition, where every document page is a single PDF file, and the burst HTML edition, where every document page is a single HTML file. I linked to every page of these two editions from http://accesspdf.com/pdf_html_test, which shuffles the two editions together. After leaving this material online for awhile, Google finally indexed it.
First, let's search the PDF pages for "beatles", as we did before. Running the Google query:
site:accesspdf.com filetype:pdf inurl:pdf_pages "beatles"
Yields 20 PDF pages, which hold a total of 32 occurrences of "beatles." This falls short of our baseline of 40 occurrences, established above. One of the pages overlooked by Google is page 44 ("John Lennon of the Beatles"). A direct search for page 44:
site:accesspdf.com filetype:pdf inurl:pdf_pages inurl:pg_044
indicates that Google did not index it.
Now let's run the same search on the HTML pages:
site:accesspdf.com filetype:html inurl:html_pages "beatles"
This turns up 25 pages with a total of 38 occurrences of "beatles." But our baseline is 25 pages with 40 occurrences! Our 40 baseline hits include "beatlesesque" (page 119) and "beatless" (page 192). Google does not lump these in with "beatles," as our baseline search did.
Given two, nearly identical documents, one in PDF and one in HTML, Google seems to have completely indexed the HTML document while indexing only about 80% of the PDF. How about that?