11/6/2022 0 Comments How do google search engine works![]() ![]() It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Apple, Meta ( Facebook) Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. (Refer fig.Then-CEO, and former Chairman of Google Eric Schmidt with cofounders Sergey Brin and Larry Page (left to right) in 2008 Counts are computed not only for every type of hit but for every type and proximity. For every matched set of hits, proximity is computed. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. Finally, the IR score is combined with PageRank to give a final rank to the document.įor a multi-word search, the situation is more complicated. Then it computes an IR score for the document. Google counts the number of hits of each type on the hit list. In order to rank a document with a single word query, Google looks at that document’s hit list for that word. Combining all of this information into a rank is difficult.įirst, consider the simplest case - a single word query. Every hit list includes position, font, and capitalization information. Google maintains much more information about web documents than typical search engines. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on a large part of the Internet. This means running a crawler which connects to more than half a million servers and generates tens of millions of log entries. This factor makes the crawler a complex component of the system.(Refer fig) This is necessary to retrieve web pages at a fast enough pace. This algorithm differs with the search engines as well as the kind of query. It works on a simple iterative algorithm. Each crawler keeps roughly 300 connections open at once. A single URLserver serves lists of URLs to a number of crawlers. To scale to millions of web pages, Google has a fast distributed crawling system. ![]() Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. Running a web crawler is a challenging task. The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The sorter also produces a list of wordIDs and offsets into the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. The links database is used to compute PageRank for all the documents. It also generates a database of links, which are pairs of docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. This file contains enough information to determine where each link points from and to, and the text of the link. It parses out all the links in every web page and stores important information about them in an anchors file. The indexer performs another important function. The indexer distributes these hits into a set of barrels, creating a partially sorted forward index. The hits record the word, position in the document, an approximation of font size, and capitalization. Each document is converted into a set of word occurrences called hits. Indexer reads the repository, uncompresses the documents, and parses them. The indexing function is performed by the indexer and the sorter. Every web page has an associated ID number called a docID, which is assigned whenever a new URL is parsed out of a web page. The web pages that are fetched are then sent to the storeserver, which then compresses and stores the web pages into a repository. There is a URL server that sends lists of URLs to be fetched to the crawlers. In Google Search engine, the web crawling is done by several distributed crawlers. You may like to read Introduction to search engines before we begin with this post. The architecture of Google search engine: ![]()
0 Comments
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |