Cuil is cool
By now you’ll have heard of Cuil, the new search engine designed by former Googlers, announced in a massive press splash on Monday (July 28th, 2008).
I’m certain it will unseat Google as the leading search engine, and quickly too. There are a few reasons I think this:
A new approach
-
These people split no infinitives. What other major site can claim that these days? None I’ve seen except Cuil :)
-
We now have a tool that enables us to interact with more of the Web than heretofore possible. On their site, Cuil claims their index includes “three times as many [pages] as Google.” Just think: ever since Google became the dominant ‘choice’ in search, what we think of as the Web has actually been just a small set of the total number of pages out there, filtered through the view of a single corporation (it sounds like AOL all over again ;)
Not only does Google return a smaller set of Web pages than the total number possible, but their algorithms also favor the same set of pages from within their already smaller set, all of which rank highly (in the top 10) time and again even on varied search queries. This is a natural result of applying their proprietary PageRank algorithm to determine each page’s importance (i.e. position in the rankings) based on that page’s popularity, which Google understands to mean the number of links from other pages that are themselves popular (i.e. those linked to by other popular pages). Google’s current approach has worked well in the past, especially some years back when there were fewer established sites and popularity networks (consider blogrolls). But by relying on PageRank today, Google ends up heavily favoring larger businesses — Fortune 500 companies, major news carriers and the same popular blog sites — in its top results (these being the only sites capable of holding the most ‘important’ links from the other most ‘important’ sites on the web).
With Cuil, on the other hand, we see a far wider variety of search results. In fact, any Web page now has a chance to be seen in the top search results (a fact that has caused rather a shake-up in the blogosphere, as well as in the SEO community, these last couple of days). Cuil ranks pages based entirely on the relevancy of their actual content. Without popularity as a requirement, any site with fresh and tasty content can begin attracting visits as soon as it sets out on the web. This is an incredibly exciting (not to mention long overdue) step forward in Web search, especially coinciding as it does so happily with Cuil’s amassing the largest ever searchable pool of Web pages. As a sidenote, it’s interesting that on its first day at least, Cuil has been serving up different results for the same queries on purpose (according to their VP of Communications Vince Sollitto, quoted here), presumably another way Cuil is “providing different and more insightful answers that illustrate the vastness and the variety of the Web” (in CEO and co-founder Tom Costello’s words, quoted in Monday’s press release).
-
Context is vital to disambiguating certain linguistic expressions (Wikipedia already addresses this with its ‘disambiguation pages’, which link to the various possible meanings for any given query, e.g. a search for SEO yields this intermediary page). When Web pages rely on keywords that (completely by chance) have other, more popular meanings, these pages have much less chance of ever being seen on Google (or other search engines). Cuil gives us back the ability to use expressions with multiple meanings in our Web searches, by separating out all the different contexts for a given search query into tabs… click on a tab and it lists the top 10 results just for that given context. Although it appears this feature is still only partially implemented (not all terms are disambiguated yet, assumedly because it requires an AI-type approach of feeding all known ambiguities into the system), it’s certain Cuil will be refining this feature over time. If you don’t know what you’re looking for, Cuil also offers helpful suggestions in the upper right… for those times when you don’t know enough about your search to know what words to use in searching. This is another feature that will undoubtedly be improved as the folks at Cuil learn from recent usage of their system.
-
Cuil’s search interface itself is a vast improvement over other search engines. For instance, their results page renders in a pleasing multi-column format (something we’ve been moving to with our own sites when possible). As soon as I returned to Google I realized how inadequate those few lines of bare ‘description’ that Google includes with each link really are. Most often these are just fragments of longer sentences whose context is entirely missing (sometimes these fragments even come from fleeting ad text on the page, skewing the search results further). Cuil instead displays a whole section of the page’s text, along with a representative image taken from the page, both of which combine to give a much more immediate impression of the relevance of a given result (actually there has been some trouble associating the image results properly, with many sites including our own showing images from other sites, but this is a well-reported bug sure to be addressed soon).
I would imagine that what we see right now is just the beginning… after all, one of the first things you see when you explore Cuil’s site are the words “The Internet has grown. We think it’s time search did too.” One thing I would be interested in seeing is a modification to their search results page enabling new columns to appear (or disappear) on-the-fly to accommodate as many results as fit comfortably on the screen without scrolling (rather than limiting search results to the traditional 10-at-a-time for any given context). Or is 10… ahem, 11 in Cuil now that I count ;)… the perfect number, with a larger number (even on a widescreen display) causing usability issues, and slowing down the querier?
-
Whatever you think of their blend of “content-based relevance methods” and “results organized by ideas,” (I happen to think very highly), Cuil is also the only search engine to be able to claim “complete user privacy” (quoting Cuil’s press release). Privacy has been a hot topic for search engines, something first apparent when Google, Yahoo, MSN, and AOL were asked by the U.S. Department of Justice to hand over “a random sampling of 1 million search queries” submitted over a one-week period. 7 months later, AOL chose to publish 20 million search queries from its database “for research purposes,” unwittingly exposing the personally identifiable information of 657,000 AOL users (including clues to home addresses, search habits, and other personal details). In light of the current climate of logging more information on everyone, Cuil’s Privacy Policy sets a new and refreshing standard in its industry. Intelligible, reasonable and sensible… not words you often associate with legal pages on the Web (or anywhere else, for that matter). Cuil is the first search engine not to keep any logs on its users or their search histories… none whatsoever. This important comfort (some might even say ‘right’) comes at a perfect time.
Google’s reaction
It’s worth taking note of Google’s reaction to its new competitor, which has been a little less nonchalant than usual. On Google’s Official Blog, the following claim just happened to be posted by Software Engineers Jesse Alpert & Nissan Hajaj on Friday (July 25th, 2008): Google had recently “hit a milestone: 1 trillion… unique URLs on the web at once” (We knew the web was big…)
This aptly-timed announcement has generated a great deal of confusion over whether or not we can trust Cuil’s claim that its index of 120 billion web pages is in fact three times larger than any other search engine (although we can imagine that former Googlers of such stature probably have a pretty good clue about the actual size of Google’s index).
But if Google’s index is really so much larger, why doesn’t Google just tell everyone (or at least the press)? Even in this blog post, the engineers don’t actually claim Google has the biggest index, although they come close (in the fourth paragraph) by saying that “we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.” Well, most comprehensive doesn’t necessarily mean largest, and goals aren’t always reached. It’s worth investigation…
A closer look at Cuil’s FAQs reveals they’ve actually “crawled 186 billion pages and have included 120 billion in our index;” the missing 66 billion are accounted for as “a number of duplicate pages that we didn’t include in our index,” as well as a small amount of spam. In Google’s post they claim to have found 1 trillion URLs, even after subtracting an undisclosed number of “URLs with exactly the same content or URLs that are auto-generated copies of each other”). So it would seem we’re comparing a mere 120 billion unique indexed pages to Google’s far larger 1 trillion.
Or are we? Google hasn’t actually said they’ve indexed 1 trillion pages… they’ve only admitted in their blog post to having “hit,” “found,” or “seen” this number of unique links. On the face of it, a seemingly trivial omission… after all, why wouldn’t they index all the links they found, so that everyone can reach these pages? And herein lies the key to the confusion, admitted but not emphasized in the same Google post (again in the fourth paragraph): “We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content…”
Not identical, but too similar? Leaving aside the concern that we’d rather not have one company setting itself up as a judge of what people can find on the web (particularly a concern with Google’s ever-increasing market share compared to its next nearest competitors1), does Google have any reasons not to index all the Web pages to which it finds links? I rather think it might, because many things in this world come down to cost, and for Google indexing all these pages is costly on two counts:
-
From the beginning (at first in order to save money, and then perhaps as a matter of habit) Google has been guided by the principle that larger numbers of cheaper servers can do the work of smaller numbers of more expensive servers. This early decision has shaped the system architecture that Google has designed from scratch since its inception (from file system to database engine to distributed programming technology).2 With a now massive parallel search network (with countless machines involved in the indexing of any one page, each processing highly atomized instructions), it is likely Google has more incentive than you’d think to dedicate processing power to indexing sites that bring in its revenue, notably popular news and blog sites that change frequently (and happen to be reindexed every 15 minutes or so by Google).
-
On the subject of revenue, another point to consider is that 99% of Google’s comes from advertising. These days, many people feel they have to buy into Google’s ad platform just to be seen on the Web… especially those with small businesses (and others with content they wish to share) who otherwise haven’t a chance (because they’re not yet popular or well-known) of appearing in Google’s “organic listings” (these are the main 10 links in each search results page). Businesses with large marketing budgets pay for ads on Google simply to cover their bases (and often show up twice, once in Google’s top 10, then again in the ad section). But would small businesses and individuals (whom combined likely represent a fair chunk of Google’s revenue) pay for all this AdWords nonsense, if their pages were already visible to their intended audience? If searchers found what they were looking for quickly in the top results, would they be as likely to look through the ad-supported suggestions? Is it possible that Google’s revenue relies to some extent in the irrelevancy of its core search results?
I’ll leave you to ponder these questions… and to consider one last tasty tidbit, the only hint we have on Cuil’s potentially revolutionary server network. There were many reports on Monday that Cuil returned irrelevant results (or worse, none at all) for common search queries. These odd or nonexistent results were in fact the symptoms of a new kind of web application crash. According to Cuil’s VP of Communications Vince Sollitto (in his comments Monday), the reason the overload affected search results the way it did lies in the fact that Cuil’s servers are apparently specialists with expertise in certain sorts of queries. So as various servers dropped offline, certain types of expertise would drop from the search engine’s consciousness, as it were, just for the time being until coming back up (only for others to drop off, or all to drop off at once, leaving raving reviewers with no results at all).
Final thoughts
I cannot agree with the many people who simply wrote off Cuil because of its rocky launch on Monday, condemning it before it had been live even 12 hours. To be honest, as soon as I saw Cuil’s press release Sunday night (July 27th, 2008), I wished them luck with going offline the next day… it’s become rather a rite of passage for successful web applications these days, a fact that hasn’t escaped the more astute news sources.
I’ve also read a fair few people railing on about Cuil without even troubling themselves to read about its new features, instead judging it by the great Google Yardstick. I even came across posts writing Cuil off simply because it doesn’t show up as a result in a search for itself. One has to wonder why it would be useful, if I’m already on the Cuil website, to find it again, via itself? ;)
And so, without further ado, I wish Cuil the best of luck (although they won’t need it :) in their unending quest… boldly “to index the whole Web, to analyze deeply its pages and to organize results in a rich and helpful way that allows you to explore fully the subject of your search.” I for one am breaking my Google habit.
1 61.6% of the U.S. market in April 2008, compared with 20.4% for Yahoo, 9.1% for Microsoft, 4.6% for AOL, and 4.3% for Ask in the same month, according to comScore, Inc.
2 For more (much more) on the little that has been gleaned about Google’s proprietary systems in detail over the years, How Google Works is an interesting read.
Filed under: Search Engines — elise @ 11:55 pm

