External search functionality

I realise this isn’t a Percussion question exactly, but I was wondering how users were implementing their website search functionality. And, in particular, if there was anyone who had integrated a front-facing search which effectively queried the Percussion database in a meaningful way.
We currently have a Google Search Appliance which obviously indexes the published HTML from Percussion. What our business would prefer is some way of searching on all the meta-data associated with our content directly. So, as a simple example, finding all the news items published in the last week by a certain department.


Our most recent use case is to implement multi-faceted search against CMS content. We will be publishing the content out to several database tables that preserve the data model of the content item and its multiple facets (probably better to call them attributes). Then a Lucene collection will be built to consume the data from those database tables. Then a Ruby app will act as the front-end UI to the Lucene collection. Sure, the Ruby app could just query directly against the database tables instead of using Lucene as the middle-man, but there’s a general desire to ensure that our front-end search applications are consistent in that they all query against Lucene. Most of our Lucene collections are built by directly crawling the HTML pages of the site, but to implement a multi-faceted search, you need to ensure the Lucene collection data model closely mirrors the CMS data model; the HTML format is too flat to ensure facet integrity.

Another benefit of this approach: You don’t have to monkey with the Rhythmyx repository in any way. You definitely don’t want to get into the game of trying to filter the content (you’re searching) by public vs. archive vs. preview, etc. Let the edition content lists do that heavy lifting for you. You’d just need to implement a jdbc resource (database) to publish to, and implement database publishing templates. Downside: I don’t know if the Google Search Appliance has the functionality of populating collections by querying against a database or if you could fake it.

Given that Lucene is the built-in search engine behind Rhythmyx 6.7, I’ve often pondered if those existing search indices could be leveraged. Since it’s a search engine that is tightly coupled to the application, and those search indices are configured by the software, I’m generally leary of trying to use those indices in any kind of front-facing search capabilities.

Your approach is interesting and touches a lot of the issues we’d started to discuss.
We’d thought about publishing out the meta-data and then writing a separate application to query this meta-data in a similar way to what you’ve suggested. The implications of running queries against the database itself with all the filtering and relationships to deal with would, as you say, be a significant challenge.
The idea of adding Lucene into the mix was something we hadn’t considered but definitely worth looking into.
It’s good to know that your solution proved itself in a concrete implementation.

Has this ever been implemented?