Tuesday, September 23, 2014

The transition of web search - From open to fragmented and proprietary



The beginning: open-web indexes
Web indexing (also called crawling or spidering) was a means of creating a private index of published html documents on the web in the 1990s.  Popular search engines would use keyword position in the document, frequency of keyword mention, font style and meta-tag data to determine a public document’s relevance to a given keyword query.  This manner of indexing and searching was the first wave of search engine mechanics.  Crawlers could start at the top node of any public domain and follow any link from the hosting center to make a replica of the content that was being posted daily.  (DNS, the domain name lookup service managed by internet service providers, was also a public resource for translating internet protocol identities into legible domain names, broadcasting new site URLs daily.)  The open web allowed multiple companies to launch differentiated services indexing the same open web.  AOL, Yahoo!, Altavista, HotBot, Lycos and Excite were all able to promote distinct services through their own portals and search interfaces and competition thrived.  Internet users, publishers and software companies benefited from this openness.


Early spam-evasion
With the emergence of search engine optimization (SEO), many of these engines had problems with irrelevant content surfacing for specific popular keywords because the source of the ranking was based on what a single publisher posted on its own domains about itself.  This approach was obviously vulnerable to abuse.  Some publishers would “keyword-stuff” their meta-tags or put footers of keywords in invisible fonts, strategies meant to dupe users into visiting irrelevant pages to inflate traffic volumes and ad dollars through extra impression volume.  Each portal used manual intervention models to address SEO spam.  Yahoo! had it’s own editorial directory that served the top layer of results, which prevented SEO spam from surfacing to the top of a results page.  Microsoft, used LookSmart.  AOL and Google used the Netscape Open Directory Project (aka DMOZ) as human validation of site relevance to a specific subject category.  

While Yahoo!’s directory was proprietary and LookSmart’s was licensed in a syndication model, DMOZ was provided as a free tool to any site that needed a search directory.  It was curated by a group of volunteers who felt prestige from being category editors of different subject matter branches of the directory treeThis model of human attention as a "relevancy multiplier" for the directories was leveraged by Google to pull ahead in the anti-spam competition.  Larry Page included in his "Pagerank" algorithm a mechanism to determine how many web publishers inserted links to specific domains, and indexed those as subject matter authorities for the topic of the reference links.  Links that publishers embedded to other pages were seen as an indication of curatorial attention of the linking site that merited the referenced page being displayed higher in the results.  The unique advantage of Pagerank was that it mostly ignored what publishers said about themselves and focused what others said about them, subverting the SEO spam practices of the era.  (The human intervention theme recurs later.)

Pagerank peer-review system, in contrast to the editor-review system of the directories, enabled Google to pull ahead in relevance as the web grew in scale, thereby unseating Inktomi as the algorithmic engine of the largest portal at the time, Yahoo.com.  Based on this success, Google transitioned from a backfill-search provider, like Inktomi, Fast and Looksmart to become a destination website like Altavista, Excite, Lycos and HotBot.  After winning AOL portal as a distribution partner, Google’s brand was finally entrenched. (Google insisted on brand name "powered by" messaging that established brand familiarity unlike the other white label search providers.)


The emergence of paid search and subsequent market consolidation
The growth in scale of the web through the early 2000’s required a significant investment in infrastructure.  It also forced the evolution of the business models for existing search providers to subsidize this growth.  LookSmart, Inktomi and Overture offered licensed search engines, in a hosted or feed-based model, which consolidated thousands of small-scale sites using single domain lander pages.  Unlike the destination-site search engines Altavista, Lycos, HotBot and Excite, these distributed search engines could be integrated in to the look and feel of the host site design.  (Integrated look and feel was not supported by Google.)  Based on the consolidated market share that these companies developed, pay-to-index (sponsored search) emerged.  LookSmart’s and Inktomi’s model was pay for inclusion in the index.  Goto/Overture was a bid for placement layer only on the top 3-5 positions above other search engine algorithmic results.  Bid for placement yielded higher returns for Overture and enabled price competition between the hosted search providers that ultimately favored portals turning to the Overture sponsored model.

Google built its own version of the Overture style of paid search, resulting in a lawsuit over Overture's US patent (USPTO patent number: 6,269,361).  Yahoo!, then powered by Google, acquired Overture, and resolved the suit in a pre-IPO stock trade with Google.  It then dropped Google as an algorithmic backfill provider and used the Overture revenue to build it’s own in-house search engine based on the acquired tools of Altavista, Inktomi and Fast combined with its existing directory.  Microsoft ultimately ended it’s dependency on Overture and Inktomi to launch it’s own search engine, which it later cross-licensed to Yahoo! in exchange for ad technology, sales collaboration and placement commitments.  This technology struggle and pricing battle resulted in the market we have at present, with the market consolidated in two major search engines with all portals using Bid-for-placement advertising as the business model.


Shift of landscape to users as publishers (aka “Web 2.0”)
With the boom of small-publishers, blogs and the emergence of social media, the publisher-centric web faced a significant challenge.  Google’s Pagerank is after all a publisher-centric signal.  Google and Bing indexing took weeks to surface new content in the web index so there was a significant lag from publication time to spidering by the crawler.  A need for a more timely experience of the web was emerging.  The fragmenting of publication platforms into different content types also presented a challenge.  Some content was published behind login “walled-gardens” no longer accessible to web crawlers.  Many actively blocked crawlers with "robot.txt" files or rotated their domains daily to obfuscate new content from indexing.  (New companies were intentionally hiding their content from search engines because they feared the global dominance of US search engines.) Some publishers transitioned to dynamic pages, meaning that a web page was not a static indexable entity but an assembled collage of content sources served in a single view at time of page load.  Finally, the seat of relevancy for new content couldn’t depend on an algorithm that took weeks to discover new signals of relevancy by waiting for publishers to cross-link to them.  A Blogspot or Wordpress author proved to be just as likely to be an authority on a granular topics as an established site that had high popularity rating in Alexa and Comscore, popular site ranking services.  For any trending real-time topic, it might be difficult for a conventional search engine to come up with a means to designate rank-authority from an algorithmic perspective.  New search technologies were needed.


User-centric data and indexing
With the user-centric publishing trend (along with encrypted data exchange mechanisms for easy server networking to transfer data within logged states) many companies that hosted user-publishing such as Blogger, Wordpress, Facebook, MySpace, Twitter, enabled open data access via web protocols like RSS, ATOM and RestAPI feeds.   And because an account-authentication token could be passed at the time of query, it allowed users to have a highly personalized view of the real-time web.  Bing, Google, Yahoo! experimented with their own personalized search services during this period.  


Real-time Search concepts
This open-data trend allowed a new algorithmic search engines to emerge that did not need to build complex offline indexes of the former large-publisher world.  In fact, the search engines accessing social mentions of mainstream publisher content became a secondary signal of user interest and attention that mapped the “Web 1.0” web as effectively as Pagerank, but made it discoverable faster than traditional search engines.  The unique advantage of these new approaches was that they were able to combine interest breadth signals across the global audience regarding the subject matter, together with its impact amplitude and duration on that audience over time.  (Visualize social media impact signals across these three dimensions and you'll grasp the significance of this in algorithmic analysis terms.) This is because the the nature of microblog sites enabled the capture of reactions at the moment of discovery by readers, across disparate feedback channels.  

Because of the brevity and immediacy of sharing that was taking off on Twitter specifically, and because of its inherently public nature, it became the focus of the data mining of real-time social signals.  (Friendster, MySpace, Yahoo! 360 and Facebook were more follower and private-circle oriented.)  Early social search start-ups Summize and Topsy introduced means to rapidly search the public corpus of Twitter content leveraging the symbols users invented to demarcate identity handles (@...) from conversation threads (#...) in tweets.  Topsy focused on who was sharing certain topics, and the authority of the source.  Summize focused on "who" was sharing and “what” that was shared.   Tweetdeck introduced a tool to sift multiple threads in parallel based on these topic and author threads.  Collecta introduced means to sift the broad corpus of real-time media including and beyond Twitter using XMPP sift by subject matter in an ephemeral and immediate relevancy analysis.  Google launched “Google Realtime” to add social media queries to its traditional web content search.  Bing, enabled users to connect to a Facebook account to see content that a user’s friends had shared on that social network.


Social Sharing as currency of the moment
One advantage that social networks (like Facebook, Twitter, Google+, Ning, LiveJournal, Nextdoor) have in terms of understanding the real-time web, is the nature of the public act of sharing.  When a social media user makes a public or limited-circle share of a piece of web content, this is a vote of interest for the specific piece of content.  Contrast this to the Pagerank vote a publisher makes when they embed an anchor-text link.  Every user in the new schema is elevated to the level of attention-measuring that Pagerank had previously attributed only to webmasters.  Think also how the unique distributed feedback system creates the ultimate anti-spam data point for algorithmic engines.  It’s fairly easy to spam anchor text links these days.  It’s very difficult to simulate 100,000 distributed people reacting to a piece of web content.  When a piece of content is tweeted, liked or +1’ed by 1000 or more users who are not connected to each other within a given social network, that can be taken as an unbiased popularity/legitimacy measure.


Closed data, siloes- The web's current challenge for the next generation's search engines
While this data is very valuable for the social network itself, to tailor its own engagement approaches for users, it also of value outside the network.  However, these days the data tends to be coveted by the platform owners, for good reason.  There are privacy concerns of users of course, who you want to encourage to use the system with the privacy settings they expect around their shared content.  And also companies like to keep their internal network data closed because it can unleash future economic potential in the form of relationships with marketers.  


The Twitter APIs, as an example, generated a particularly interesting signal from an external perspective because of the public-viewability of any single post.  Most user's understand that “tweets” were inherently public.  This is why so many developers launched tools to process and digest the feed.  When Twitter shut down the API access to Google and other web crawlers, there arose a need for Google to have it’s own microblog feature to replicate this real-time signal.  Twitter needed to shut down access to the broader web to prepare for its IPO, shareholder sense of ownership and upcoming in-network advertising platforms.  But it's success as an open platform tool shows the promise for future entrants to replicate the strategy of its early days of openness.


A real-time barometer of what is popular with Internet users is an invaluable tool for refining any search engine.  Keeping up with the rapid evolving nature of the web requires more unconstrained sources for showing interest and relevancy-signals as a feedback mechanism to web publishers.  Also web publishers of tomorrow need more ways to get their content published to new audiences.  As the web is a dynamically growing entity, the opening up of social media services to allow users to publish more feedback on the web in public view is a promising step for the future of other search platforms.  

Though we face a highly siloed web ecosystem now.  Continued drive to revive openness is a boon to the tools that will emerge in coming years.



(Perspective of the author: I worked at LookSmart, Overture, Yahoo! and Collecta over a span of 1998 to 2010.  So these industry shifts were witnessed from the perspective of someone representing these companies through the shift in the technologies available to us.)