A web search engine is designed to search for information on the World Wide Web. The search results are generally presented in a list of results often referred to as search engine results pages (SERPs). The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. During the early development of the web, there was a list of webservers edited by Tim Berners-Lee and hosted on the CERN webserver. One historical snapshot from 1992 remains. As more webservers went online the central list could not keep up. On the NCSA site new servers were announced under the title "What's New!" The very first tool used for searching on the Internet was Archie. The name stands for "archive" without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites since the amount of data was so limited it could be readily searched manually.
A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are often plain text or hypertext with optional search-directives (such as "and"/"or" with "-" to exclude). They vary greatly from standard query languages, which are governed by strict syntax rules as command languages with keyword or positional parameters. There are four broad categories that cover most web search queries: Informational queries – Queries that cover a broad topic (e.g., colorado or trucks) for which there may be thousands of relevant results. Navigational queries – Queries that seek a single website or web page of a single entity (e.g., youtube or delta air lines). Transactional queries – Queries that reflect the intent of the user to perform a particular action, like purchasing a car or downloading a screen saver. Search engines often support a fourth type of query that is used far less frequently: Connectivity queries – Queries that report on the connectivity of the indexed web graph (e.g., Which links point to this URL?, and How many pages are indexed from this domain name?). Most commercial web search engines do not disclose their search logs, so information about what users are searching for on the Web is difficult to come by. Nevertheless, a study in 2001 analyzed the queries from the Excite search engine showed some interesting characteristics of web search: The average length of a search query was 2.4 terms.
Meta elements are the HTML or XHTML <meta … > element used to provide structured metadata about a Web page. Multiple elements are often used on the same page: the element is the same, but its attributes are different. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes. The meta element has two uses: either to emulate the use of the HTTP response header, or to embed additional metadata within the HTML document. With HTML up to and including HTML 4.01 and XHTML, there were four valid attributes: content, http-equiv, name and scheme. Under HTML 5 there are now five valid attributes: charset having been added. http-equiv is used to emulate the HTTP header. name to embed metadata. The value of the statement, in either case, is contained in the content attribute, which is the only required attribute unless charset is given. charset is used to indicate the character set of the document, and is available in HTML5. Such elements must be placed as tags in the head section of an HTML or XHTML document. In one form, meta elements can specify HTTP headers which should be sent before the actual content when the HTML page is served from Web server to client. For example: This specifies that the page should be served with an HTTP header called 'Content-Type' that has a value 'text/html'. In the general form, a meta element specifies name and associated content attributes describing aspects of the HTML page.
Modern web search engines are complex software systems using the technology that has evolved over the years. There are several categories of search engine software: Web search engines (example: Lucene), database or structured data search engines (example: Dieselpoint), and mixed search engines or enterprise search (example: Google Search Appliance). The largest web search engines such as Google and Yahoo! utilize tens or hundreds of thousands of computers to process billions of web pages and return results for thousands of searches per second. High volume of queries and text processing requires the software to run in highly distributed environment with high degree of redundancy. Modern search engines have the following main components: Search engines designed for searching web pages, documents and images are designed to allow searching through these largely unstructured units of content. They are built to follow a multi-stage process: crawling the pages or documents to discover their contents, indexing their content in a structured form (database or other), and finally resolving user queries to return results and links to the documents or pages from the index. In the case of full-text search for the web search, the first step in preparing web pages for search is to find and index them. In the past, search engines started with a small list of URLs as seed list, fetched the content, parsed for the links on those pages, fetched the web pages pointed to by those links which provided new links and the cycle continued until enough pages were found.
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing. Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours.
Momondo.com is a Copenhagen-based travel search engine that allows consumers to compare prices on flights, hotels and car rental. The search engine aggregates results from more than 700 travel websites simultaneously to give an overview of the best offers found. Momondo does not sell tickets; instead, it shows the consumer where to buy at the best prices and links to the supplier. It is free of charge to use Momondo, which receives commission from sponsored links and advertising. In 2007 NBC Today’s Travel recommended that when it comes to finding the best offers on flights, the consumer should go to sites like Kayak, Mobissimo, SideStep and Momondo instead of buying tickets from third-party sites that actually sell travel and are dealing directly with the airlines. In addition to price comparisons, Momondo also offers city guides written by the site's users and by bloggers based in different cities. In November 2010 Momondo co-hosted and sponsored the first European TBEX conference on travel blogging held in Copenhagen. Momondo was launched in September 2006 as a flight search engine only. In September 2009 the website was re-launched now also offering city-guides and travel content. Since it has expanded to also offering price comparisons on trains (integrated in the flight search) and it has become a multilingual site that supports English, German, French, Italian, Spanish, Portuguese, Swedish, Norwegian, Danish, Turkish, Russian and Dutch.
In computing, spamdexing (also known as search spam, search engine spam, web spam or search engine poisoning) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system. Some consider it to be a part of search engine optimization, though there are many search engine optimization methods that improve the quality and appearance of the content of web sites and serve content useful to many users. Search engines use a variety of algorithms to determine relevancy ranking. Some of these include determining whether the search term appears in the META keywords tag, others whether the search term appears in the body text or URL of a web page. Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. Also, people working for a search-engine organization can quickly block the results-listing from entire websites that use spamdexing, perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful. Common spamdexing techniques can be classified into two broad classes: content spam (or term spam) and link spam.
nofollow is a value that can be assigned to the rel attribute of an HTML a element to instruct some search engines that a hyperlink should not influence the link target's ranking in the search engine's index. It is intended to reduce the effectiveness of certain types of search engine spam, thereby improving the quality of search engine results and preventing spamdexing from occurring. The nofollow value was originally suggested to stop comment spam in blogs. Believing that comment spam affected the entire blogging community, in early 2005 Google’s Matt Cutts and Blogger’s Jason Shellen proposed the value to address the problem. The specification for nofollow is copyrighted 2005-2007 by the authors and subject to a royalty free patent policy, e.g. per the W3C Patent Policy 20040205, and IETF RFC 3667 & RFC 3668. The authors intend to submit this specification to a standards body with a liberal copyright/licensing policy such as the GMPG, IETF, and/or W3C. Link text Google announced in early 2005 that hyperlinks with rel="nofollow" would not influence the link target's PageRank. In addition, the Yahoo and Bing search engines also respect this attribute value. On June 15, 2009, Matt Cutts, a well-known software engineer of Google, announced on his blog that GoogleBot will no longer treat nofollowed links in the same way, in order to prevent webmasters from using nofollow for PageRank sculpting. As a result of this change the usage of nofollow leads to evaporation of pagerank.
PolyCola, previously known as GahooYoogle, is a metasearch engine that searches multiple search engines at once was created by Arbel Hakopian. It was started with the domain www.GahooYoogle.com in 2005. When it was first known to the public, it was discussed in BBC radio, was chosen at HotSite by USA Today and managed to have entries in Fox News Channel. However, the site was shut down due to legal problems. GahooYoogle.com had a legal problem, so after the shut down, the site was moved to Yahoo! with the order made by the court after being an issue for couple of years. After the shut down, the creator, Arbel Hakopian, decided to expand his original idea of GahooYoogle.com and came up with the idea of PolyCola.com; currently it is operating with the address www.polycola.com. The new and improved PolyCola.com lets searchers to minimize the time and problem they might have in using multiple search engines. PolyCola is a metasearch engine. A metasearch engine is a tool which lets you submit a word or phrase in the search box. Then it sends your search concurrently to other individual search engines which then sends it to its own databases. Within couple of seconds, you receive the result from several search engines. A metasearch engine only sends your search terms to databases of individual search engines; it does not have its own database of web pages. Search engines are made up with three main parts. First, the search engine follows links on the web in order to request pages that are either not yet cataloged or have been updated.
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters. This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam). A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.
The history of the Internet began with the development of computers in the 1950s. This began with point-to-point communication between mainframe computers and terminals, expanded to point-to-point connections between computers and then early research into packet switching. Packet switched networks such as ARPANET, Mark I at NPL in the UK, CYCLADES, Merit Network, Tymnet, and Telenet, were developed in the late 1960s and early 1970s using a variety of protocols. The ARPANET in particular led to the development of protocols for internetworking, where multiple separate networks could be joined together into a network of networks. In 1982 the Internet Protocol Suite (TCP/IP) was standardized and the concept of a world-wide network of fully interconnected TCP/IP networks called the Internet was introduced. Access to the ARPANET was expanded in 1981 when the National Science Foundation (NSF) developed the Computer Science Network (CSNET) and again in 1986 when NSFNET provided access to supercomputer sites in the United States from research and education organizations. Commercial internet service providers (ISPs) began to emerge in the late 1980s and 1990s. The ARPANET was decommissioned in 1990. The Internet was commercialized in 1995 when NSFNET was decommissioned, removing the last restrictions on the use of the Internet to carry commercial traffic.