Web crawling is the last resort

Data source analysis is one of the crucial parts of an enterprise search deployment project. Search engine results quality strongly depends on an indexed data quality. In case of web-based sources, there are two basic ways of reaching the data: internal and external. Internal method involves reading the data directly from its storage place, such as a database, filesystem files or API. Documents are read by some criteria or all documents are read, depending on requirements. External technique relies on reading a rendered HTML with content via HTTP, the same way as it is read by human users. Reaching further documents (so called content discovery) is achieved by following hyperlinks present in the content or with a sitemap. This method is called a web crawling.

The crawling, in contrary to a direct source reading, does not require particular preparations. In a minimal variant, just a starting URL is required and that’s it. Content encoding is detected automatically, off the shelf components extract text from the HTML. The web crawling may appear as a quick and easy way to collect a content to be indexed. But after deeper analysis, it turns out to have multiple serious drawbacks.

Continue reading

How to Create Knowledge Sharing Intranets and the Role of Search

“If only HP knew what HP knows, we would be three times more productive”

The quote is a statement from the former chief executive of Hewlett-Packard, Lew Platt and summarizes this week’s discussion on knowledge sharing intranets at the conference “Sociala intranät” (Social Intranets) in Stockholm.

For two days intranet managers, editors, web strategists and communication managers gathered in Stockholm to talk about the benefits (and pitfalls) of knowledge sharing intranets where the end-users share and contribute with their own and their colleagues information. And what role search plays in a Social Intranet.

A number of larger companies and organization, such as TeliaSonera, Thomas Cook, Manpower and Perstorp, have started their second generation of intranets: where blogs, collaborative areas, wikis, personalization, micro blogging (see the twitter flow from the conference)  and Facebook-inspired solutions finally seem to work in a larger scale.

The pioneers, such as Fredrik Heidenholm from Skånemejerier, has been doing it without a large budget – proving that social intranets are more about users than expensive technical solutions.

Read interviews of Fredrik Heidenholm, Gunilla Rehnberg (Röda Korset) Hans Gustafsson (Boverket)  and Lisa Thorngren (Thomas Cook Northern Europe – Ving).

And in general, the speakers as well as the attendees seem to be agreeing with one another: having the whole organization contributing with their knowledge is a prerequisite for keeping knowledge sharing intranets alive.

But letting everyone create information requires a good enterprise search solution, something some of Findwise customers, such as Ericsson and Landstinget i Jönköping, talked about: “Search promotes the value of our social intranet” said Karin Hamberg, Enterprise Architect, at Ericsson. Search makes it possible to gather information from all kind of sources and make it accessible from one entrance. However, this also requires strategies for handling security restrictions (who should have access to what?), metadata models, user experience (expectations and behavior) and ranking (who determinates which results that should appear on the very top?).

Sven-Åke Svensson, from Landstinget i Jönköping, had the same experiences and emphasized the need for a good prestudy (workshop method) and tools for the editors such as a metadata service to help the contributors write good metadata tags. Sven-Åke also made a demo of the new intranet (if you are Swedish, the blog post “Landsting på väg mot det social intranätet” gives a great overview of the solution)

The two days covered most angles of Lew Platt’s vision – and apart from a number of good speakers the informal talk at coffee breaks and lunch gave a good insight in the fact that Swedish companies are working hard to provide knowledge sharing intranets that serves consumers as well as contributors.

Did you visit the conference? Was there anything in particular you found interesting? Please feel free to comment and share your thoughts.

P.S. If you want to read more about social intranets, take a look at Oscar Berg’s blogpost “The business case for social intranets”. An inspiring summary of the topic.

Basic Enterprise Search is a Commodity – Let’s Go Further! Conclusions of Trends 2007

Looking back at the search trends that were predicted for 2007 one can conclude that many of the larger research institutes, such as Forrester and Gartner, made a great forecast.

2007 was supposed to raise the question of 2.0 for search technology within the companies (and it seems like wikis, blogs and collaborative tools was all that we heard about for some months). Further on, there was a discussion of integration with business tools, such as BI and search, to create more powerful ways to extract critical data from several sources. The fact that IBM bought Cognos and FAST Radar says something about what we can expect in the future.

During the last months there has been a discussion of more sophisticated ways to develop enterprise search. The leading niche vendors such as Autonomy, FAST and Endeca have for the last few years been evangelists for search that goes beyond spellchecking and synonyms – talking in terms of information retrieval and knowledge management. It seems like a lot of companies have evaluated search capabilities, starting with basic functions, simply to realise that search actually solves a lot of problems they haven’t even considered.

It’s not a coincidence that giants such as IBM and Microsoft have strategies that will bring their enterprise search capabilities to new levels 2008.

Internet life in the Future

I always think it’s nice when I hear people talking about the same things that are on my mind these days. It makes me reflect upon things in new ways and also makes me realize that I’m on to something. I attended a presentation by Björn Jeffery from Good Old (hosted by Region Västra Götaland). His talk on internet strategy was interesting and had many things in common with the keynote by Elizabeth Churchill (Yahoo) that I recently heard at the HCI2007 conference. Two things interested me most; the future of mobility and the inevitable question of integrity. So here are my thoughts today, on internet strategy and the future of internet usage.

Integrity

Today young people have become used to using different web 2.0 technologies such as Flickr, Facebook, Delicious etc. So we have seen the emergence of things such as social search and folksonomies. People gladly contribute with information about themselves and what they think and like. I believe this is a good thing, but there are also some risks with this. These risks are that once something is on the internet and is indexed, it’s out there and it stays there. Many people are not aware of that fact. How do you keep your integrity when everything about you can be found online? Integrity is very important when implementing these solutions in an enterprise setting.

How can people contribute without having to share their stuff with everyone else if they don’t want to? Björn Jeffery mentioned that we’ve gone from sharing nothing with noone to sharing everything with everyone and that he thought this would change back to us sharing a lot of things with many people. I hope he’s right. Teenagers might note care who they share their stuff with, but security and integrity are vital issues when considering enterprise solutions.

Mobility

In these days mobility has become an important thing. We not only expect to be able to find the information we need but to find it whenever we want from where ever we want to. I am actually writing this blog post on a train, and off course I expect to have access to all Findwise and other resources from here as well. As technology changes our behavior and expectations change with it, and so does society. (I covered excitement generators in a previous post about Jared Spools keynote on HCI2007.)

“I don’t use computers, love. This is just the internet”.

quote from Elizabeth Churchills keynote

Today there is no longer an association between internet and the computer screen. Mobile phones have become an increasingly popular way of accessing the internet. So, you can use search to access all your company’s information from a single point of access when ever you need it. Then maybe next step is mobile search on your intranet? That would not only make information become available at all time but from where ever you might be, and exactly when you want it.

So in conclusion of these talks; I think that in the future we will want to be able to access everything from everywhere at any time. We used to talk about time we spent online. That distinction isn’t really there any more. Today our tasks are interweawed, we don’t separate time we spend online and offline. (Something that becomes painfully obvious when trying to work on the train when you’ve forgotten the usbconnection for the mobile internet.) And in that time we spend online we also need to define what things we want to share with whom. If we as designers can solve these things, I think we’re on to something promising.

How Many Users Can You Afford to Annoy?

The second keynote at the Human Computer Interaction conference in Lancaster was given by Jared Spool who talked about Breaking through the invisible walls of usability research. Jared is a very inspiring and entertaining speaker. If you have the chance to listen to him, take it!

One of the things he talked about was the fact that the usability techniques that are widely used today were in fact not designed for large amounts of users. We have all kinds of data about the users’ behaviors online, but can we really use that data in a productive way? As Jared said;

“there is a big difference between data and information, we don’t know what inferences to make from the data we have.”

He also gave examples from a couple of large american ecommerce sites that have millions of users every day. With traditional usability measures you, according to Jacob Nielsens report, can identify 80% of the usability problems with as few as five users. But if you have one million customers, then you could say that 200.000 of the customers would be annoyed. Imagine how much money’s worth of lost revenue 200.000 users is. So how many nines to we need? (90, 99, 99,999?) How many percent is enough? It is apparent that we need to find methods that can solve these problems with usability evalutations and testing.

Jared Spool visualizes how few users actually spend money on an ecommerce site, and how few users the company relies on for their revenue.

Jared also talked about the consequences that web 2.0 have had for web applications and communities. He talked about what things that make people want to use “extra functionality”, as for example review functionality; what things delight people. Things that are excitement generators today soon come to be expected in every application. And when, as Jared said, HCI becomes HHHHHCI; when social networks are widely used, things that delight us or aggravate us, spread very fast. So instead of thinking about the five user rule, think about this next time you plan a release of a new product or application: How many users can you afford to annoy?

Usability 2.0

A lot is happening in the world of enterprise search. Recent blog posts include discussions of how enterprise 2.0 tools can be integrated into corporate systems; see discussions of taxonomies or integration on Social Glass for example. Or take a look at Bill Ives examples of people who achieved success with E 2.0, on the FAST Forward blog. These new trends are also starting to affect how we talk about usability. A couple of months ago there was a seminar about how web 2.0 technology have consequences for usability. (Watch the video from the seminar Usability 2.0.)

This week I am attending the 2007 HCI conference in Lancaster, UK. I will present an article, written together with Mattias Arvola, discussing how prototyping techniques can structure conversation in different stakeholder groups. On this traditional HCI conference, web 2.0and search technology have also entered the scene, with both keynotes and papers being presented about these subjects. Therefore I am looking forward to many interesting presentations and discussions about what effects enterprise 2.0 tools have on usability. So stay tuned for reports from the conference…

Find People with Spock

Today, Google is the main source for finding information on the web, regardless of the kind of information you’re looking for. Let it be company information, diseases, or to find people – Google is used for finding everything. While Google is doing a great job in finding relevant information, it can be good to explore alternatives that are concentrated upon a more specific target.

In the previous post, Karl blogged about alternatives to Google that provides a different user interface. Earlier, Caroline has enlightened us about search engines that leads to new ways on how to use search. Today I am going to continue on these tracks and tell you a bit about a new challenger, Spock, and my first impressions of using it.

Spock, relased last week in beta version, is a search engine for finding people. Interest in finding people, both celebreties and ordinary people has risen the past years; just look at the popularity of social networking sites such as LinkedIn and Facebook. By using a search engine dedicated to finding people, you get more relevancy in the hits and more information in each hit. Spock crawls the above mentioned sites, as well as a bunch of others to gather the information about people you want to find.

When you begin to use Spock, you instantly see the difference in search results compared to Google. Searching for “Java developer Seattle” in Spock returns a huge list of Java developers positioned in Seattle. With Google, you get a bunch of hiring applications. Searching for a famous person like Steve Jobs with Google, you find yourself with thousands of pages about the CEO of Apple. Using Spock, you will learn that there are a lot of other people around the world also named Steve Jobs. With each hit, you find more information such as pictures, related people, links to pages that the person is mentioned on, etc.

In true Web 2.0 fashion, Spock uses tags to place people into categories. By exploring these tags, you will find even more people that might be of interest. Users can even register on Spock to add and edit tags and information about people.

Over all, Spock seems like a great search engine to me. The fact that users can contribute to the content, a fact that has made Wikipedia to what it is today, combined with good relevancy and a clean interface it has a promising future. It also shows how it is possible to compete with Google and the other giants at the search market by focusing on a specific target and deliver an excellent search experience in that particular area.

Using Search for Web and Enterprise 2.0? Plan for the Future!

Buzzwords such as ‘the long tail’, ‘user generated content’ and ‘web 2.0’ has been around for some time now, but does it automatically mean that everyone understands the way that technology is heading? And what happens with search?

If you haven’t seen the rather old, but brilliant video The machine is us/ing us on Youtube you should. If you have, you should take a look at the updated version.

When working with search on a daily basis one tries to get behind the fuzzy words to see how blogs, wikis, RSS, mash-ups and social tagging among other things will affect the way we interact and do business in the future. Linking Wikipedia to these words is only one example of knowledge sharing that wasn’t possible a few years ago.

The tools that the new web 2.0 development provides us with helps us create and gather more information than ever. As the amount of information increases rapidly, according to Gartner an average company doubles (!) its information every 6-18 months, the need for efficient search solutions becomes crucial in order to handle the vast amounts of data.

All search vendors claim that they will be able to provide effective search for these purposes. As a customer you should ask yourself; what is the future need of my business? Do I need a search solution that provides support for basic functionality such as spellchecking and static relevance adjustments? Is there a need for more advanced functionality that increases cross-functional sharing in the organisation such as dynamic navigators and common workspaces? Do I want to use search to increase knowledge sharing powered by web 2.0 tools?

An interesting and short debate presentation can be found here. In conclusion; Different stages of maturity require different approaches to achieve different outcomes.

These questions may seem to be looking too far ahead? I can say for sure that by asking the right questions from the beginning you can save yourself a lot of time and the company a lot of money (and use your solutions for present as well as future needs).

By knowing your users, your organization and its future you can make search solutions that help enable knowledge discovery, sharing, and connection, which in the end is what web 2.0 and enterprise 2.0 is all about.