In an earlier post about selecting Google Search Appliance (GSA) targets for this project, the narrative definitely edged toward the more abstract. We highlighted four principal sets of GSA targets: files on our newly created “shared document repositories”; repurposed intranet content being moved over from an MediaWiki installation on an old server; cherry-picked content available on LSNC’s varied public websites (LSNC maintains 13 distinct public websites and subsites); and select records in Pika CMS, the secured, web-based case management system used by all our advocates.
As far as it goes, this abstract list of GSA target sets fairly summarizes what we, as an organization, want to make transparent via enterprise search, which is to say make “findable” in ways not practicable without the GSA. This abstract list of GSA targets, however, fails to convey what we have done at non-abstract, practical level to make those targets useful to our larger search goals.
So, let me hit a few notes about several practical decisions we’ve made at launch as we target the GSA at real files offering real search results.
As described in Part One, when we first unpacked our GSA and aimed it, uh, somewhat aimlessly at any and every file on one of our local servers, the GSA did its job in killer fashion… and blew out our file limit. While one can proceed that way, we were always mindful that we had to sort out how to organize and structure the shared content that we seriously wanted to make searchable and findable. So, one of the first tasks we confronted on this project was to work out our thinking about “taxonomy,” resulting in the basic directory structures we have adopted.
That taxonomic “organization” step was essential to this project, but completing that particular project objective doesn’t translate directly into searchable content organized in a particular way. You see, there is this pesky little detail: Real people need to actually identify the existing and/or newly created files to be included and then somehow get the files in the directories on the shared document repository that are the target of the GSA.
Easier said than done.
In our case — particularly given the limited IT and support staff resources available to us as a typical legal services field program — we had to come up with some practical approaches to move existing files from any number of different locations to the designated shared locations or document repositories. (I will discuss how we handle adding newly created files, in a later post.) Here’s what we did with our existing files to fold them into the content targeted by our GSA:
1. Initially, include all existing “staff-specific” content, with an opt-out
We did find ourselves on the receiving end of a lot of staff enthusiasm about this project. Truly. But it is impractical and unrealistic to expect your individual legal services advocates and other staff to comb through all their thousands of files and then move them over to a different file server location. (Maybe it should be realistic to expect them to do this, but in our experience it just ain’t gonna happen. No way.) But there are tons of content gold in them thar files, so we had to figure out a way to initially get all that good stuff in place, even if not parsed out in a taxonomic sense, so we could target it.
To accomplish this, we first vetted with, and got buy-in from, all our local offices to do the following:
On the local project file server for each local office, we created a special project “archive” directory. Then each local office manager copied each individual staff member’s files wholesale over to a user-specific directory in this so-called archive directory. Having an unequivocal “opt-out” option was important to the success of this approach. Again and again, in formal meetings and informal discussions, we reminded office staff that they could ask that any or all files to be removed from these initial archive-file targets. No questions asked. There were a few such requests, but not a lot: One advocate asked that her files not be targeted at all, so we removed all her files; two others had less than a dozen files they wanted removed as targets, so we did so. No biggie.
The net effect is that this makes the targeted advocate files initially non-taxonomic, but in short order you have a huge repository that has a (allow me to exaggerate here, for literary effect) 99% chance of including pretty much everything the individual staff members would add if they “woulda, coulda, shoulda,” so to speak. In our case, this initially amounts to about 300,000 document files, the vast majority of which are advocate-generated files.
At launch, this does mean that these office-specific, bulk compilations of existing files added as targets include a significant number of drafts and duplicates that one would normally not include as a shared file if it were being added as a newly created file. For example, within our office culture it is not only common but actually expected that advocates not work in isolation on major cases. (We discourage the “lone eagle” model.) So, our early search-results testing shows that often the same file shows up in more than one target location because more than one advocate has a copy of the file in their archive.
It bears mentioning two other factors we kept in mind as part of this initial targeting of shared files: We double- and triple-checked with all management staff to assure nothing management-sensitive or -confidential was moved to a location where it could be targeted. Also, before we moved anything over wholesale, as described, we asked all staff to remove certain types of files that no one would reasonably expect to be part of the searchable content. Examples: Family photos, MP3 music downloads, YouTube videos, yada yada yada. Enough said.
We do have an approach in mind for “peeling off” these office-specific archives over time, to separate out the drafts and duplicates and place them within our taxonomic directory structures. More about that later.
2. Using Google Sites as the platform for our existing intranet content
I recall having a passing conversation with Gabrielle Hammond at last year’s TIG conference about how we were holding off on further intranet development while waiting to see how Google implements its JotSpot-based wiki application, now known as Google Sites.
Well, people, we now know what Google Sites is all about and we love it! For the last several years we had been using MediaWiki as the publication platform for our intranet, but we are in the process of replacing it for our internal wiki needs. We are about half way through that process, which should be completed shortly after the first of the year.
One big bonus of moving our intranet content to Google Sites is that it is quasi-tailor made to work with both the GSA and Google Analytics. I say “quasi” because the interactions between them are good but hardly optimal at the moment. For example, only days ago Google Analytics for Google Apps was rolled out, but the quality of the data we are getting so far is not so easy to get a handle on. More importantly, Google promises GSA integration with Google Sites, but it is still a buggy implementation. We have easily targeted test site pages within our domain’s Google Sites, but have hit a wall with getting the GSA to properly return search results on the indexed content within files uploaded to Google Sites. Turns out we are one of several organizations that have identified this problem and Google Enterprise support assures it will have a fix with its next software upgrade, in about a month or two. We (and our GSA consultant) are confident this will work in due time, but it’s one of those details we have to wait on for now.
3. Updating our public web content
Over the last 10 years, LSNC has placed an enormous amount of its advocate content out on the public Web. But one recent example is the California Food Stamp Guide, a prime example of public content that our advocates can search at that site, but would want to be able to search directly via our GSA shared portal. It is also one example of a content cluster that can be part of or its own GSA collection. (“Collections are logical views of information in the index, as defined by URL patterns. This allows you, for example, to index the entire contents of your intranet, but then divide it up into logical groups of content.”)
Implementation of The Findability Project has prompted some public housekeeping. Our target testing of our public content, predictably, reveals that we have stuff out there that is, well… past its shelf-life, shall we say. So we are working on a systematic way to thoroughly review and clean up that public content. It is obvious but important: Current and correct public content means better search results via the GSA. (Apologies to the larger legal services community for not doing it sooner.)
4. Targeting our case management system
We consider our Pika case management system a key, long-term GSA target. But we are not there yet. We have prioritized getting all the other targeted content organized and in position, with clear protocols in place. We also are busy reworking on our shared portal, which will integrate the GSA search functions and provide users with (hopefully) intuitive ways to filter their search results, search select content collections, and provide the users with some nice Google GSA touches like OneBox searches, among other features.
That all said, being able to target our case management system is a total no-brainer and perhaps the most practical of necessities. In a given day, there is likely nothing more common or more vital to our work for clients than the search for information within our case management system. The native search functions built into the current version 3.07 of Pika are good. But we are optimistic that we can exploit the GSA to make those searches even better. And certainly more integrated with everything else in our new enterprise search universe.