Out Technology >
When systems combine geographic entity resolution with geographic information retrieval, one reaches a new vista for information access and discovery. MetaCarta GTS is such a system. As in all information retrieval systems relevance and speed performance are crucial factors in GIR.
GIR Relevance
GIR systems generate result sets in response to user queries. MetaCarta GTS computes several numerical scores for each of the document-location pairs returned in a result set. The total relevance score combines these scores into a single number, which allows the system to present the most useful results first. The total relevance is an estimate of the probability that a particular document-location result will be interesting to the user. This estimate is based on the information input by the user in the query request. Higher relevance results are generally more valuable to users.
MetaCarta GTS combines both textual relevance and geographic relevance into the total relevance score. The textual relevance indicates how well the document responds to the particular keywords entered by the user. The geographic relevance indicates how well the document responds to the area indicated by the user's query. The geographic relevance combines many factors related to the location references in the text, including the proximity of terms in the document, the proximity of the places mentioned, text emphasis clues, and the geoconfidence.
GeoRelevance of a document-location pair is based in part on the geographic confidence generated by the geoparser.
GeoRelevance also uses the position in the document and the prominence of the references to this location. The latter is a function of whether it is in the title or header, whether it is emphasized or rendered in a large font, and other clues related to the nature and formatting of a document. This is similar to term relevance heuristics in information retrieval, but the pattern of emphasis of geographic references is somewhat different.
GeoRelevance also utilizes the frequency of references to the location in the document and in the larger corpus of documents; this is similar to standard information retrieval techniques.
Speed Performance of GIR
Within GTS, answering a user's query entails searching the index for geographic coordinates within a specified multi-dimensional range (the boundaries of the current map view). Most queries entail searching for keywords as well. Keywords and spatial coordinates are different types of data. Words and other string tokens are discrete variables, and spatial coordinates are continuous variables.
Normally, a search over different data types (discrete and continuous) requires time proportional to conducting the searches separately across two indexes and then joining the results. Solutions that attempt such a dual index strategy perform very slowly. Queries in such systems can take days or weeks, because they must consider all of the documents in the corpus at query time. The dual nature of such an approach prevents pre-indexing.
To address this problem, as well as the issues surrounding geographic relevance, MetaCarta developed CartaTrees, a specialized index capable of handling continuous and discrete data elements in a single structure, eliminating the need for a join at query time. The result is very fast and scalable search performance, even over extremely large collections of documents.
In the language of computer science, the order complexity of the CartaTrees intersection algorithm is "big O of n" or O(n), where n is the number of document-location pairs returned to the user and not the size of the corpus.
Users intuitively appreciate the value of high-speed, high-relevance GIR. Results update instantly as they pan and zoom the map. Surprising information discoveries emerge naturally in the process of exploring the map.