<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Webdogs 2.0 &#187; pdf</title>
	<atom:link href="http://www.webdogs.org/tag/pdf/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.webdogs.org</link>
	<description>Webdogs 2.0 ~ data, design and derring-do since, uh, whenever</description>
	<lastBuildDate>Mon, 26 Jul 2010 21:04:16 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Converting hard-copy documents for addition to the shared repository</title>
		<link>http://www.webdogs.org/2008/11/12/converting-hard-copy-documents-for-addition-to-the-shared-repository/</link>
		<comments>http://www.webdogs.org/2008/11/12/converting-hard-copy-documents-for-addition-to-the-shared-repository/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 21:19:56 +0000</pubDate>
		<dc:creator>Brian Lawlor</dc:creator>
				<category><![CDATA[planning]]></category>
		<category><![CDATA[findability]]></category>
		<category><![CDATA[gsa]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[tfp]]></category>

		<guid isPermaLink="false">http://www.webdogs.org/findability/?p=296</guid>
		<description><![CDATA[A late October post at the Official Google Blog entitled A picture of a thousand words? prompts me to draw attention to an analogous TFP document protocol we worked out a few months. It is worth highlighting because it is so practical and will be an invaluable source of additional knowledge content targeted by our [...]]]></description>
			<content:encoded><![CDATA[<p>A late October post at the Official Google Blog entitled <a href="http://googleblog.blogspot.com/2008/10/picture-of-thousand-words.html">A picture of a thousand words?</a> prompts me to draw attention to an analogous TFP document protocol we worked out a few months. It is worth highlighting because it is so practical and will be an invaluable source of additional knowledge content targeted by our GSA.</p>
<p>But first, the Google post: <a href="http://googleblog.blogspot.com/2008/10/picture-of-thousand-words.html">Read it</a> and you&#8217;ll discover that &#8220;In the past, scanned documents were rarely included in search results as we couldn&#8217;t be sure of their content. We had occasional clues from references to the document, so you might get a search result with a title but no snippet highlighting your query. Today, that changes. <em>We are now able to perform OCR on any scanned documents that we find stored in Adobe&#8217;s PDF format</em>.&#8221;  (As lawyers are so fond of saying, emphasis added.) As the post illustrates by example, do a Google search for <a href="http://www.google.com/search?q=repairing+aluminum+wiring">repairing aluminum wiring</a> and at the top you&#8217;ll see a PDF listed. If you download the PDF and open it, and you&#8217;ll discover it is an image of a text document. The downloaded file is itself not text searchable. But click <a href="http://74.125.95.104/search?q=cache:UKOp9FetcNIJ:www.cpsc.gov/CPSCPUB/PUBS/516.pdf+repairing+aluminum+wiring&#038;hl=en&#038;ct=clnk&#038;cd=1&#038;gl=us">View as HTML</a> for that same result and you&#8217;ll discover that the text is actually indexed and searchable via Google.</p>
<p>Essentially, we are doing the same thing within our own enterprise search ecosystem, but with an added advantage. Not only have we adopted a <a href="http://www.webdogs.org/2008/10/23/going-forward-document-best-practices-and-protocols/">document handling protocol</a> for using our networked printers/scanners to convert select hard-copy text documents to PDF image files, we also process the resulting PDF images through Adobe Acrobat&#8217;s native &#8220;OCR Text Recognition&#8221; tool, add then save it with some basic metadata added.</p>
<p>Once added to the shared document repository, the scanned and OCR&#8217;d text document is then fully indexed and searchable by the GSA. And when the user finds and downloads the file, it is fully text searchable itself when opened in Adobe Acrobat or Adobe Reader. One better than what Google itself now does, superbly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webdogs.org/2008/11/12/converting-hard-copy-documents-for-addition-to-the-shared-repository/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
