|
SCaNspect
has a team of domain experts who, over a period of 30 man years, built an ontology consisting of many thousands of linguistic elements (terminology, phraseology and grammatical forms), which are specific to each domain. The ontology is used to compile all the elements into a highly condensed linguistic code that is used to identify the relational level of document content to specific domains by locating the elements in the text.
The linguistic code is produced by looking at every linguistic element found in a given language, and attributing it (by physically labeling the element with a tag) to an appropriate knowledge domain. Some linguistic elements can be attributed to more than one domain and have respectively more than one tag. The string of tags essentially forms the linguistic
code.
How is the relational level quantified? The entire text of a given document is also physically tagged according to the linguistic code. The code itself specifies which tag should be attributed to which linguistic element. In other words, every word, phrase and grammatical form in the document is given a domain attribute. Where an ambiguity arises as to which domain an element should be attributed to, the algorithm progressively (in an expanding fashion) looks to other indisputable elements surrounding the ambiguous element in order to determine the correct context.
What are the algorithms? Each domain and element is tagged by a weighted variable. The unit weight is an arbitrary number. A phrase, for example, has a higher weight than a word. Each weight has been tuned through extensive manual testing throughout the construction of the ontology to achieve the greatest accuracy in scanning the document.
| |
|
| Copyright © SCaNspect,
2004 |
|