User-agent-enhanced Websites

By alpers at 10:53 pm on February 10, 2008 | 2 Comments

Gradually over the year of 2007, I’ve been turning to Google to help me get through sticky problems with open-ended programming projects. As I’ve moved from Java to actual implementable languages such as Python and C#, I’ve found that more and more of my answer end up at places such as experts-exchange.com. I’m of course ecstatic that my exact problem has been found on the great big interweb; the Google summary shows me part of a solution! Of course, when I actually navigate to the site, I’m greeted with a greatly-reduced page with lots of ‘trial options’ (example). What happened to my content that I just saw highlighted on Google? It’s nowhere to be found.

After a little bit of self-education in Spring 2007, I discovered some meta-data called user-agents. Apparently whenever a browser visits a website, server-side scripting languages can query for the client’s browser type and format the page to best suit the client’s internet browser. This meta-data was instated to allow more compatibility between web browsers in terms of emerging technology and legacy support. Search bots, like Google’s googlebot and Microsoft’s msnbot also carry around their own special user-agent name, distinguishing themselves from a client’s browser. Using this implicit ‘feature’ of user-agents, sites can choose to show content to some user-agents, but hide content to others.

Using this strategy, two assets are

  • Gaining search engine web-crawlers attention with diverse content and luring potential customers to pay for the answers, and
  • Keeping private content for the greater benefit of their customers

Given this, there are several adversaries that would like to take control of the assets:

  • Programmers
    • are looking for the content of one article for personal or corporate use, or
  • Competing self-help programming communities
    • are looking at the content of the entire site at once (stripping content for external use) to enhance their own communities.

Since the user-agent meta-data was designed to help improve compatibility between web browsers, there was no security built into enforcing that the user-agent actually describes the client’s web browser. Given this, there are multiple extensions written for Firefox and some for IE to help spoof your own user-agent.

Given this information, two weaknesses are

  • Too much trust on the incoming user-agent data
    • Solely based on the user-agent data, EE can display the full version of the content, and
  • User-agents change over time
    • That is, when newer versions of googlebot or Firefox come out, the user-agent will change

We’re beginning to see that depending solely one user-agents to determine the content of a page isn’t really too robust since the underlying standard of user-agents isn’t robust itself. Stemming from this, it’s time to realize the risks in deciding to depend on user-agent when constructing a site.

A huge advantage of using user-agents is that it allows search bots to crawl a site and notice a high amount of inter-linking between content pages. With many unique keywords popping up in various question and solution pages, it’s a ripe place for a search crawler to spend many hours indexing. It’s a very easy, inexpensive way of luring potential customers toward purchasing membership through by gaining popularity in the search results. No extra servers are needed to lure these new people, the search engines do all the work of the indexing and when a user searches for the particular problem on a search engine whose search crawler has access to the content, it becomes very highly regarded in the search results and therefore its relevance is increased.

Since this is an inexpensive way to get more users fast, it’s okay if some users circumvent the user-agent restriction by user-agent spoofing, because the resulting registration numbers are higher than before!

This still remains an interesting example to me as to how something intended for one purpose is exploited and used for another. In doing so, we see unprecedented issues come up that we couldn’t have possibly imagined before! 🙂

Filed under: Miscellaneous,Security Reviews2 Comments »

2 Comments

  • 1
    Get your own gravatar for comments by visiting gravatar.com

    Comment by cbhacking

    February 11, 2008 @ 1:53 am

    Side note: Konqueror and Opera also support User-agent spoofing. I don’t know about Safari, however.

    In addition to getting around restrictions by pretending to be a web crawler, user-agent spoofing can be used to handle web sites that present different HTML to different browsers. For example. some browsers such as Konqueror and Opera have very capable rendering engines (better than the Gecko engine used by current versions of Firefox, in many cases) but because their user-agents are unknown they get fed very basic code by some sites. User-agent spoofing often allows this to be worked around, for example, Konqueror by default identifies itself as Firefox when visiting Gmail, to avoid being redirected to the basic HTML view.

    This is also useful when a browser (usually IE) is intentionally given reduced-functionality code. A great many sites that had to be “dumbed down” for IE6 work fine in IE7, but generally all that the server looks for is the “MSIE” substring in the user agent. Using a plug-in (such as IE7Pro), it’s possible to work around these and get the full site.

  • 2
    Get your own gravatar for comments by visiting gravatar.com

    Comment by Ole Hansen

    March 24, 2008 @ 3:03 am

    In the case of experts-exchange, I don’t think they actually use the user agent string for this. Try following your own link and scroll all the way to the bottom.

RSS feed for comments on this post