WO2001033419A2 - Access by content based computer system - Google Patents
Access by content based computer system Download PDFInfo
- Publication number
- WO2001033419A2 WO2001033419A2 PCT/IB2000/001697 IB0001697W WO0133419A2 WO 2001033419 A2 WO2001033419 A2 WO 2001033419A2 IB 0001697 W IB0001697 W IB 0001697W WO 0133419 A2 WO0133419 A2 WO 0133419A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- focuser
- words
- steps
- content
- focus
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Definitions
- This invention relates to content and context accessible computer systems.
- the restaurant, the piano bar and the spare ribs are the primary memories. One is not primarily concerned about the date of the event.
- This invention comprises a content-accessible method and system for operation of a computer.
- the three main parts of this invention include, first, a method for defining, classifying and indexing content; second, a method for designating all real numbers, including integers, such that they can be arranged easily in a monotonic fashion; and third, a fast, linear method of sorting through the content according to the associated monotonic real numbers, including integers, to access contents.
- Fully Optimized Content Using Computer System is designed as a linguistic package that is simple enough to be handled by the average user and fast enough to cope with network speeds. It is an indexing and searching system, which combines simplicity, speed, and efficiency and is based on users of the system knowing what information they want to get, as opposed to knowing an address where some information might be stored.
- Figure 1 shows the functional architecture of the content accessed computer system
- Figure 2A shows naming of the monotonic numbering sequence
- Figure 2B shows a finer detail for the naming of the monotonic numbering sequence
- Figure 3 illustrates part of the linear sorting technique
- Figure 4 shows the L-shaped repository files.
- This invention comprises a content-accessible method and system for operation of a computer.
- the main operating parts of this invention are shown in Figure 1 , including, first, a method for defining, classifying and indexing content (Focusers 101); second, a method for designating all real numbers, including integers, such that they can be arranged easily in a monotonic fashion (coding, 102 ); and third, a fast, linear method of sorting 103 through the content according to the associated monotonic real numbers, including integers.
- the system is designed to meet the users twofold needs: (1) ease of use and being free to express the request as the user likes (Query 104), (2) getting pertinent and immediate information 105.
- the content accessible information functionally resides in a repository 106.
- Point 1 is mainly a matter of ergonomics.
- a simple dialogue box and a button asking for some "advanced" search solve this aspect.
- the possibility of searching by "themes” is added to this.
- a theme is defined by words, expressions or texts and is used to get pertinent information.
- Point 2 is achieved by systems using semantics: the meaning of words.
- Prior systems trying to utilize semantics ended up with a heavy computational overhead.
- the processes described further fall into the field of semantics, they are very far from what is currently used in this field, which have not yet proved to be useful.
- the FOCUS system's ergonomics are simple.
- A one text field
- B one radio button "theme”
- C one button “advanced search”
- D one button "search by meaning”.
- the text field allows the use of a convention to "force” groups of words as opposed to a mere sequence of words (such as using quotes to define the group of words).
- Advanced search deals with Boolean logic applied to ⁇ arch components. Search by meaning is done through a file or a list of files corresponding to a given theme. These files could be a list of words.
- a "theme” is a list of simple words chosen by the user as that user's expression of a given concept, along with the frequency in the given texts. This list of simple words can then be sorted by pertinence using the frequency deviance from a reference.
- the "theme” radio button indicates that the word or expression to be searched for is to be considered as a theme.
- the themes are pre-computed, thus allowing very fast semantic search.
- Results can be displayed in three forms. The first is using the title of the document when it can be identified. The second is using some sort of automatic abstract (first sentences or first expressions, or first most deviant expressions in the most pertinent paragraphs in the file). The third is displaying the text itself, in which case the viewing program might highlight pertinent paragraphs and words and so allow an easier navigation through the hits.
- references themselves are to be recorded as contents. They are part of the items of the computer system that carry a meaning. For instance, for each filename, one will record the complete path along with all of its components: drive ID, directory names, directory extensions, file name, file extension and every word in these.
- Instant retrieval means building a repository with direct access to all its information contents. Having linear characteristics means that from the first step on, namely, the sorting routine that feeds the repository, every process must be linear on any set of data available at one time. Being independent of the amount of information to monitor allows one to avoid the use of multiple repository files as well as the copy of the repository to update it.
- Real-time processing means handling of file manipulation such as updating, renaming, moving and deleting as they occur without interruption or break in the time line.
- Intelligent filtering as part of the FOCUS system, is not simply giving a list of wanted or unwanted terms. However, it is usually undertaken with techniques with a heavy computational burden that could require several minutes to decide upon the fate of a single sentence. The approach here is very different. The semantic configuration only requires the user to give examples of texts that are meant to be representative of the "theme" desired to be defined.
- the "feeding" of FOCUS can be done in a variety of ways. For instance, receiving data from a network can trigger the analysis of it's content. Updating a database record can trigger the analysis as well. Updating a word processor file can also trigger the analysis. The concept of Focusers.
- a Focuser is defined as a set of the following elements: (1) its name, (2) a list of words or expressions that represents the Focuser, (3) its language description (French, English, ...), as an option but not required, (4) several parameters that control its existence in a paragraph, or its list of words or expressions: (a) Number of words or expressions in a paragraph, fewer of which found in a paragraph indicate that the paragraph is not relevant for the Focuser. This number is used for the detection of a Focuser. (b) Number of words or expressions in a paragraph below which the paragraph might be relevant and above which the paragraph is really relevant for the Focuser. (c) Threshold of pertinence, below this number, the word does not belong to the Focuser. This threshold is used to build of a Focuser. Building Focusers from a text
- A specific expressions that are very relevant;
- B synonyms (which can be automatically proposed by the system);
- C words or expressions, which should be excluded from the Focuser (forced zero pertinence value);
- D words or expressions, which discriminate the Focuser, i.e. they are automatically excluded from the Focuser (negative pertinence value);
- E a word that is accepted, excluding all expressions containing the word.
- the Focuser is then represented by some of the words that have the biggest value.
- the threshold can be chosen by the program or the user.
- a very simple example of a text for Focuser is the following: This is about horses , horse, horseback cavalry, @mounted_troupes . horse . Where "@mounted_troupes" is a very relevant expression.
- the Focuser itself would look like the list of words where non pertinent words have been removed: Cavalry, horse, horses, horseback, @mounted_troupes
- a Focuser does not mean anything per se but takes all its meaning when it is compared to a text that has paragraphs. So, a Focuser is said to be recognized when a certain number of words or expressions pertaining to this Focuser are recognized in the same paragraph. This number is a parameter of the Focuser, and its value is usually around 3. Expressions that have been manually entered in the Focuser have a value of 3 times the value of a single word in the Focuser. Automatic routing/filtering using content of a text
- a "profile” is a set of positive and negative Focusers. Whenever a mail or an HTML text or any text goes though the filter, all defined Focusers are detected (see paragraph "Detection of a Focuser") and compared to the profile: if a negative Focuser has been recognized, the text is rejected, else if there is at least one positive Focuser, the text is accepted. Any combination of positive and negative Focuser can be used.
- the semantic analysis process qualifies every paragraph according to predefined Focusers. Instead of just recording that qualification along with the data one can use it to route (or stop) the incoming information when particular themes are associated to particular destinations. On the other hand, all this can be kept in a central repository, each user being configured to have an implicit Boolean AND with his themes. This procedure would allow some network manager to monitor what is routed (or stopped), and where it is routed (stopped).
- the algorithm is the following: get all paragraphs containing this word, then extract the concept from the concatenation of this concept. It is very time consuming to do this for all words, so, for example, one can limit this kind of extraction to expression that contain at least two meaningful words. So, the user can ask for a given expression as such or for this expression as a concept.
- the information stored in the automaton is only the address to the information which is, for example, stored in a file.
- the ⁇ separator> and ⁇ type of information> must be small (usually one letter) in order to be as compact as possible.
- semantic networks, thesauri, lexicons, etc. are represented this way provided they can be contained in memory.
- Basic routines of dictionary manipulation can be implemented in silicon either on a general processor based controller or on a dedicated RISC chip. Automatic viewpoint of a text using deviance
- the Focuser is said to be a "viewpoint" on the text.
- a viewpoint simply given the most deviant word or expression that are both in the text and the Focuser.
- Expressions do not need to be explicitly in the Focuser because the deviance of the expression is the compounded deviance of the words in the expression.
- a compounded deviance is the sum of the deviance of the single words contained in the expression divided by their numbers. If no specific viewpoint is given, the viewpoint is simply the Focuser build from the text itself.
- Recognizing a particular language could be done by comparing all the terms of a sentence to all the dictionaries of the planet coded in every possible character code page. It is easy to see that this method is too heavy. A much faster and roughly as accurate solution is to characterize languages by the statistic distribution of n-uplets.
- the elements to be sorted are simply, the n-uplet starting on the first byte, the one starting on the second byte, and so on. Then one counts the duplicates and compares the result to a pre-built n-uplets database.
- This database is built applying the same process to a large enough text to be representative. A sample text of the order of magnitude of 1Mbyte is considered sufficient to build the database.
- This database consists of n-uplets and their corresponding frequency for each combination language/code page.
- a content accessible computer considers every content to have three parameters (1) a reference (2) the content itself (3) a position.
- a reference is a string of bytes that indicates to the hosting software how to get to the content. This can be a file name, an address in memory, a field in the database record, a file in an archive, etc. References are preceded by a code indicating which software will be asked for the content: the filing system for a file name, an archiving program for an archived reference, a database management system for a record, etc. The rest of the string is an argument that is passed to the host, such as SQL request for a database manager. The coding used is host dependent.
- a third aspect is position. This is not a physical position in the computer's memory or the computer's storage area. This position refers to something meaningful to the user. In a text file, for example, this position can be the number of the chapter, number of the' section, number of the paragraph, etc. In a database, this may be the number of the table, the number of a record, the number of a field, the number of a paragraph, the number of a sentence, and so on. In a medical scanner's image, this position could be XYZ coordinates, and the red-green-blue RGB false color contents. Position is primarily recorded to provide proximity management.
- This technology provides a way to identify the content of a document before opening it, which can be a relatively lengthy operation on large documents, accessed via a network.
- the technology of glimpses uses the idea of an abstract of the document, but creates it automatically from the content of the whole document.
- the linguistics of a FOCUS extracts groups of words, it is relatively easy to select the groups that are most characteristic of a document by comparing them to a statistical analysis of a given body of data.
- a Focuser is a text file giving a list of terms, synonyms and expressions of how the "Focuser owner" would speak of something when speaking to another person on a given subject. For any set of text that is considered pertaining to a given domain, this domain can be described in a Focuser and the result can be used as above to give a more accurate reference for this domain.
- a dynamic Focuser is a Focuser that is not compiled and created at filtering time but one which is analyzed and created in real time on one or several Focus Repositories.
- a Focuser is a text file giving a list of terms, synonyms and expressions of how the "Focuser owner" would speak of something when speaking to another person on a given subject.
- Computing a Focuser is not restricted to the time a document is analyzed. In fact, it can be done any time, even when the user asks a question.
- the format can be anything. The user may type a series of words and/or expressions, or type real sentences, or better, grab entire sentences and paragraphs from any document, in particular those which have been found through another query on their FOCUS.
- the first job of the system is to make a query of all the words in this text. Then to sort them according to the documents and the positions they're in. This function is greatly helped if the number of the paragraphs are stored in the FOCUS' memory, if not this technique reverts to what was known as "clustering", which, by the way is a linguistics nonsense. Allowing entire paragraphs to be used as a REAL query with an acceptable response time is a breakthrough in NATURAL LANGUAGE computing. Even more, these queries can be intermixed (Boolean) with "ordinary" queries on text, images, music, etc.
- this operation is fast enough to be carried interactively.
- this definition (the Focuser) and its results (the selected paragraphs) can be recorded by the FOCUS.
- This operation again, involves no access to the original documents and can be appropriate for a network environment (such as a Web portal).
- a request on the expressions taken in the glimpses of a given document can quickly provide with a set of similar documents (textual, images, music, etc.). There are two ways of achieving this.
- variable length coding needs the length to be part of the code.
- Computer data are always a string of bits (or bytes).
- the first upper bits in first byte of the string indicate the length of the string in bytes.
- all bits are used to indicate the value. This is hence optimal.
- the space taken is absolute minimum and there is no limit to the numbers (when all 8 bits of the first byte are set, coding goes on the second byte).
- the strings are binary monotonic, which allows sorting and order testing without decoding. Routines to code and decode are straightforward.
- This coding is independent on the processor's representation of numbers (big or small endian). It can be based on other units such as 16 bit words when computers agree on their 16 bit binary representation. If the numerical fields in a database comply with the MFPC coding, then all fields can be sorted as binary values without dynamic collation for numerical fields. For numerical information, that is, real numbers including integers,
- FOCUS does not put any limitation either as to the number of digits in the mantissa or the value of the exponent.
- the real numbers can be divided according to the sign of the mantissa 201 and of the exponent 202. The overall value depends on these classifications.
- Figure 2B shows a further classification where, for example, infinity plus 204 or minus infinity 205 are separated out, plus one 206 and minus one 207 and zero 208 is also separated out, although these are not really required, but it doesn't hurt.
- infinity plus 204 or minus infinity 205 are separated out, plus one 206 and minus one 207 and zero 208 is also separated out, although these are not really required, but it doesn't hurt.
- the mantissa 201 in the first row with a plus sign 209 the exponent 202 also with a plus sign 210, and this is coded as D 211 in this system.
- the mantissa 201 is plus 212 the exponent minus 213 and the code is C 214.
- the C 214 is less than D 211 or D 211 is greater than C 214, intrinsically.
- the next case has the mantissa 201 as minus 216, the exponent 202 as plus 217, and is coded as B 218.
- B 218 is less than C 214, however, it is greater than the code A 219, which has a mantissa 201 of minus 220 and an exponent of minus 221.
- the mantissa In the first case if it is positive, but its something to the minus power and getting larger which is one over something and getting very big and in the next to last row one sees that the mantissa is minus so its a negative number, the exponent is positive so that number is getting minus or something to the power of very big so its getting to be very big, but its a very big negative number, so in F 251 and B 218 areas the larger the absolute value of the exponent, the smaller the value of the number. Therefore, one might want to compliment this to some base. For example, 12 in the base 10 would become 99 minus 12 equals 87. The absolute value of the mantissa then has some coded value or to some base.
- One way of multiplexing numbers is to use numbers written in the base N and the numbers are now written in the base N plus 1, so that the N plus 1 figure never appears in the number. If one is encoding in base 2, one has zeros and ones and if one does that in a base 3 then the number is N plus 1, N equals 2, then one would have 1001, etc. and then a 2 and then some other number 1001, etc., so that the higher number serves strictly as a separator. The smaller numbers of N waste less space so that the number for N 2 or 3 is preferred. It turns out that 2 is better for coding speed and 3 for decoding speeds, so the actual number used is a toss up. Linear Sorting
- the value will be the pointer to the byte, if they all ready have a value, the end will point to the byte while the byte pointer to the previous end value will be pointing to the new byte.
- the end of the scan storing in any values reported in the byte pointers in the order of the collation for that particular rank.
- Figure 3 shows the situation with an array of starting pointers 301, an array of ending pointers 302 and array of pointers to the next 303.
- the first 3 ranks or letters XYZ 304 have been sorted and found alike and now the next letter in the XYZ... series, namely, A 305, is being examined.
- the routine processes of bytes of rank are in a list.
- the next rank will only be processed when there is more than one item in a rank list. This can be known using any kind of flagging for end of input string.
- the next rank does not need to be R plus 1, but can be in any order rank that is desired.
- date/duration coding based on the MFPC A date or a duration can be expressed as a number of years, months, days, hours, minutes, seconds and fractions of second or any combination of these. Using MFPC on all these values allows one to represent any length in time no matter how small or big. For dates, all values but years and fractions of second have limited values and can thus be represented as simple binary integers. In a repository, it is advised to record the full date or duration along with the individual components, so that asking for "the month of May" will not end with a "token research". Repository - Logical Structure
- An optimal solution providing a fastest access to any individual item utilizes "L" shaped strings.
- An L shaped string is an optional “horizontal string” and a “vertical string” of at least one byte, each byte of the vertical string being associated to a pointer to the next L string. For instance, to store "Albert"
- FOCUS In order to record uniformly textural information, FOCUS uses the Unicode standard code codification. This is a 2 byte-encoding scheme that covers all the glyphs used on the planet and is open ended. L strings never duplicate a given character. In order to handle badly accented words or inconsistently spelled or written character words, more than one representative can be considered. For example, the number 12 can be written as a sequence of characters "twelve” or as a sequence "12" or as a numerical value "12". There is nothing in FOCUS that prevents using the three representations at once. The same thing may be done with a compound word such as weekend. This can be stored as "weekend” "week-end” "week” and "end”, all at the same address. automatic splitting of repository when disk access becomes critical
- the problem here is how to split a repository in the first place.
- One of the utilities used with a FOCUS is sequential reading of its repository. This program is used to build a new repository on the second disk up to a point where it is half the original. Then it goes on reading but writes the second half on another file again. Then the original is erased. This is preferable to keeping the original as the second half and erasing the data stored in the first half, since if other splits are needed, resulting in an unnecessarily empty file.
- the number and names of repositories are stored in the "basic" repository (the one that starts with the lowest strings) as a special code (the lowest) which keeps the whole management inside the FOCUS.
- Intercepting Operating system calls on file openings, file closing, writing to files, etc. allows to build a chronological log of events which in turn allows to decide which file is to be declared deleted, moved, renamed or updated. Unless any smarter method (OS dependent) could be used on updated files, an update is to be considered as a deletion followed by a creation. Partial updates will generally only stem from a new analysis (semantic, for instance) producing new parameters and canceling others. If the reference concerns a field in a database, it is only necessary to have an automatic logging of editing on this database, the logging file containing all necessary parameters. An easy format for these logs is ODBC. More generally changes and updates are better taken into account at their very generation.
- “Full text” engines are generally unable to record non-textual information. There are two ways of doing this: deciding on one or more "prefixes” and deciding on one or more "suffixes”. For instance if a textual information is meant to be prefixed by “0” then numeric information can be prefixed by "1", dates by "2", MIDI by "3", etc. If textual information is a file content, one can decide on suffix "0”. If this textual information is a parameter for the filing system, then the postfix " 1 ". If it is a filename, the " 1 " postfix can be appended by a "0", and so on. The choice of putting a pre or post fix is dependent upon the access that will be used for retrieval.
- Feeding FOCUS can be done in a variety of ways. For instance, receiving data from a network can trigger the analysis of it's content. Updating a database record can trigger the analysis as well. Updating a word processor file can also trigger the analysis. Whichever process is used, all FOCUS wants is to get a list of references to be taken into account in a dedicated file. One can point out immediately that the analysis process can also decide not to store or keep the data, such as for network filtering. But it may also flag the data as being "unacceptable” and store it anyway to allow the network manager to trace unacceptable information.
- FOCUS sorting routine is already using the FOCUS' repository format, so that users can access recent data very quickly. But multiplying these files on a particular physical disk drive means degrading the system's performances on query and updates (particularly deletions). So the final step is to "merge" these temporary repositories into a single final one on every physical drive. Although this process is relatively fast, it can be carried in low priority background (FOCUS controls its own background priorities). Depending on applications, however, temporary repositories can also be kept instead of merged (if they all carry one day of incoming data, and if they are to be erased after a given number of days, for instance).
- references are recorded in the repository and given an ID (or handle) number (unlimited integer).
- the cross-reference (from ID number to reference) is also recorded.
- ID number to reference When a reference is deleted, its cross reference is updated as being deleted, further deletion of its entries occur and at the end of this deletion process the cross reference is updated as being free.
- FOCUS is not concerned by meanings associated to data, only users are.
- the current approach here is to process the first rank and store all classes that guaranty to fit in RAM for a final processing. This is based on the fact that a class must be smaller than the amount of RAM available proportioned between what has already been read and what is left to be read, assuming that all subsequent data will have the same proportion of that class.
- the next rank is used for classification and the process goes on.
- classes are read in the collating order and classification is thoroughly performed. This occurs only once.
- the first subclass is loaded along with the class before the processing is undertaken. All classified data are then flushed onto the disk, data is reorganized in RAM and the next subclass is loaded. And so on.
- RAM random access memory
- the temporary files will use slightly more storage than the final result, but no more than the product of the number of temporary files them by the cluster size.
- the amount of RAM devoted to the process is not critical. Variations of about 20% only have been observed. Organizing the RAM for the sorting routine as here, is just one of the ways it could be implemented. At one given moment, there might be a maximum of three lists in RAM: the list of items itself, the list of files from previous loads, and the list of files being stored with the current load. Loading a buffer of data is thus done between the end of the first list and the beginning of its data structure (pointers).
- This third list is made up in the following way. Lets us consider that one is in the first rank and that the byte value is B. Lets us also consider that the size of the B class is OK to fit in a file which already exists (this is known by reading the second list as it is sorted). Then all entries of the B class will be appended to the B file. The B list is skipped. If the B file is created here, the fist entry of the B class will be given a length of 1 byte and will point to the "B+1 " (according to collation) class. When all items have been sent to their respective files, their sequence has been replaced by the second list. These "filesurnames" are then moved to append the first list and their pointers are moved before the first list's pointers. The two lists are then joined and sorted. Filenames are given in ascending order such as "0000", "0001", ..., their
- surnames are the letters they contain that have already been classified. These names can be appended to the surname taking into account that the actual data is 4 bytes (in the example) longer than the string to be sorted. Reading a data gives a direct access to the corresponding filename. If doublets are to be skipped, it is done as soon as they are detected (the fist occurrence points directly to the next item, bypassing the doublet).
- Swapping data between RAM and disk implies some clustering scheme.
- FOCUS will accept any cluster size above 4k bytes (minimum based on a 8bit data analysis and small repositories) which is convenient for the user and compatible with it's longest item (roughly twice as much).
- the cluster size For simplicity sake, in the implementation one has arbitrarily restricted the cluster size to be a power of 2, which is generally what operating systems swapping schemes are using anyway.
- This clustering scheme brings fragmentation of the repository so that L strings can point either to another L string in the same cluster or to another L string in another cluster, and this can be indicated by a flag bit.
- subclusters are a power of 1/2 fraction of a cluster.
- the general address of an L string thus become: cluster, subcluster, and byte address.
- the composite structure of FOCUS strings implies two subsequent fields.
- the length of a string is the length of the data after which numerical values contain their own length (see Unlimited monotonous integer numbers of minimum length)
- the feeding is done in such a way that the IDs are monotonous and positions within IDs also, it is enough to sort on the content only providing "duplicates" are kept in their original order.
- FOCUS is an operating system by itself, and as such has its own filing system.
- This filing system could be used by any operating system with great profit, but it also allows us to think of putting the system directly on chips aboard the disk controllers.
- This filing system uses no FAT (File Allocation Table) but is easy to be made ultra safe and allows all usual optimization to be carried relatively simply. Instead of centralizing cluster allocation, they are chained in the cluster themselves.
- an empty file is a chain of clusters.
- the chain means that each cluster has a pointer to the previous and the next one. If no previous or no next cluster exists, the pointer is set to any arbitrary value that cannot be a valid logical/physical cluster's address (usually 0).
- This chain - called the "free- chain” - can either be initialized over the whole file if its size is known at the time of creation of the file, or dynamically expanded by one or more clusters when the need arises. ⁇
- Allocating a cluster is simply taking it from the free-chain to put it in another chain built on the same principle. De-allocating a cluster is giving it back to the free-chain.
- the first cluster (root) can be used to store the starting and ending points of these chains if there is a limited number of them. If not, this recording itself will use the chaining technique. All of the usual functions of a filing system can be readily implemented. De-fragmentation, for instance, is done by copy of last clusters into the free-chain ones, and truncating the file when the free-chain is a continuous set of clusters at the end of the file.
- Multiprocessing Data Bases referencing a data base content
- the multi-processing technique makes no call to the operating system and is thus OS independent.
- Mail boxes between processors are used, and are dedicated file names that can even carry parameters. This not only allows synchronizing in and out of several processes, but also allows sending data to them with a file explorer type program. Although the same control could be obtained by running the command anew with new parameters, this explorer-type program control technique allows one to achieve a "control panel" of the process by just looking at the filenames in some directory. For instance setting a debug level can be controlled ' by filenames like "debug.005".
- FOCUS uses the Unicode standard codification. This is a two-byte encoding scheme that covers all the glyphs used on the planet and is open ended. This nevertheless doubles the size of textual information. The way the information is stored in the repository optimizes this space: L strings never duplicate a given character.
- a "word” can have more than one representation.
- the number twelve can be written as the sequence of characters “twelve”, as the sequence "12" or as a numerical value 12.
- a FOCUS will store the whole word along with its components, such as in the "weekend” example above.
- grammatically based addressing systems Locating a word in a text can be done in several ways. Let us mention three of them: a/ The "physical" address, i.e.
- a co-occurrence is defined as a link between two words. This link is simply the fact that those two words appear together in the same paragraph. Studies have shown that simple word co-occurrences do not give interesting results because of their semantic ambiguities. To the contrary, groups of words ⁇ r expressions are semantically rich enough to be used with a co-occurrence network. The basic function that co-occurrences are doing is providing a list of expressions from a given request. Those expressions are the one that appear most often with this word and all the co-occurrences between all those expressions are given for any graphical representations. All co-occurrences calculations are based upon the results from the given request.
- a request can be a word, an expression or any Boolean combination of words or expressions.
- Co-occurrences is an old idea in linguistics, meaning that there may be interesting information stemming from the words or expressions most frequently associated to a given term.
- n is the number of terms, ordinary co-occurrence processes are in n to the power 3. This means that it is impractical on large bases, and even on large RAM memories, as the CPU time is tremendous. Strangely enough, most cooccurrences systems use what is known as "clustering". This technique consists of choosing an arbitrary value (usually around 2K bytes) within which cooccurrences will be computed. This value has obviously no linguistic meaning.
- a FOCUS can record paragraph numbers of the terms it analyses. Then the computation of co-occurrences can be done on a much more significant linguistic basis. Furthermore, if a FOCUS stores the cross-references of the paragraphs with their associated words, the Whole process can be rendered linear.
- the window When the user starts typing a word, the window echoes his typing to show that the FOCUS "understands" what is going on. Conversely, when the window stays empty, it shows that some of the character typed had no echo in the FOCUS' knowledge.
- the suggested display in the right window is a sampling of words beginning by the letters before the cursor, but with a different next letter for each word displayed. For instance if the user's cursor is after the letters "BE" and the FOCUS has stored the words BEE, BEEF, BENEATH, BENIGN, we will only show BEE and BENEATH, simply because a screen space is limited and that is enough for the user to know that he won't find BEFORE as no word beginning with BEF is displayed.
- substitution engine used to handle alphabetical ligatures can also be used for Optical Character Recognition errors. Given an engine that does not make too many mistakes, i.e. to the point where the output is easily readable (our current reference is FineReader V4), common OCR mistakes such as the confusion between "rn” and "m” can be declared in the data for the substitution engine. Searching for .”modern” will then also find misspellings like "modem”. This feature which is to be triggered by some menu option or any other suitable means allows one to keep valid output from OCR engines without having to do any manual corrections and yet be able to find pertinent data in them. iava applet + javascript structure
- the current CGI (HTML) mode of operation on the Web has longer response times when compared to those of a FOCUS. It was thus necessary to design one new way to preserve as much interactivity as we could. Using a Java applet to handle the query dialog does this. Every time a key is pressed, and assuming there is no pending query, only a few characters are sent on the network.
- the FOCUS' answer can be kept very small (2K, for instance), so that once again the load of the network is minimal.
- using a CGI script would require that over 10 times more characters be sent to redraw a whole page.
- Some products are able to extract image signatures.
- Accessing a musical sequence or a video sequence can be done through the industry standard SMPTE (Society of Motion Picture Technical Engineers) time code.
- SMPTE Society of Motion Picture Technical Engineers
- the value of that code at the beginning of the sequence is what one can use as a positioning parameter.
- all these types of data usually do not occur alone but they are accompanied by textual information.
- textual information For a movie, readily available textual information is the script and dialogue (sub-titles).
- Musical information can be stored by laying a MIDI track of the melody along with the video sequence (any ordinarily skilled keyboardist can do so at once, so cost is not a big issue here).
- Running FOCUS on these textual data and using the SMPTE code as an access method one can allow direct access to any given sequence using those textual and musical parameters. searching through image signatures
- a domain is simply the subset that is the response to a query. The same query will obviously produce different domains as the database evolves and changes. So a domain will be defined simply by the query itself and applied, from now on to everything analyzed by the FOCUS.
- a domain definition can be done "dynamically” by a Focuser and then stored to be used as a reference. Domains are needed to describe the field of action of such things as Focusers and synonyms (thesauri). Their primary use is to play the role of "categories” or "taxonomies”. Given a term, FOCUS knows which domain definitions this term is used in, allowing to present to the user these domains so that he can restrict its query to it.
- Cross-references can be used for document names, sizes, languages, etc., for field names in databases, every item that is generic to a set of data. It is especially mandatory to use them if this item can be changed. For fixed data, not using cross-references only means spoiling space, but there may be applications that need it. Cross-references are also used in our implementation of co-occurrences to store which words are in which paragraph.
- the immediate answer (that would nevertheless handle the 24M of molecules patented today) is to describe all possible representations of the network (i.e. starting with every node) in a hierarchical way, which provides only one possible representation for each node. If the nodes and links are represented by a numerical code, the uniqueness can be granted simply by ordering the different ways of navigating through the network on a numerical sorting. Loops are simply special links (we suggest to give them the lowest values after the "end of string" value) which jump to a node described by the number of links backward to a common node, and the number of links forward (in the description) to the looping node. Special groups of nodes and links can easily be represented as particular node values. Recorded networks are entered as described.
- Querying networks simply consist of choosing a non ambiguous node to start with (the farthest of any ambiguous node is preferred) and describing the network under all its possible forms (according to the groups of nodes it may contain) and running the corresponding request to the FOCUS. Because of the fuzzy definition of certain groups of nodes, all the answers to the query to the FOCUS may not be answers to the user's query; it is thus necessary to select the right ones.
- the 24M of molecules already registered would involve a 90Gb repository and the access is as usual, pertinent and not more than one hundredth of a second away...
- FOCUS Embedding Focus In Existing Applications
- a FOCUS allows to revisit all of computer applications as one has them today. It is not simply a new function that can be called from an application as an external routine, it can really be EMBEDDED in - part of- all existing programs.
- FOCUS can be used to replace the existing lirrtited access to contents.
- FOCUS will give immediate access to all filenames containing that word. No more navigation through unending directory tree structure. Assuming a number of people be connected onto a network, agreeing on some - even limited - conventions on directory names and/or architecture will enable them to focus at once on other's directories.
- Targets are identified by a single letter and consist either of files or applications where the selected items will be sent to just by typing the associated letter. Contextual menus and/or using the Ctrl or Alt keys along with pressing the letter key allows the user to monitor the "target”: changing the name of the file, the target application, the various options associated to the sending of the data, etc. Basically, if a target is.a filename, all information sent to that file will simply be appended.
- Update The first solution is to use a "phantom table" with the same structure as the original one, this phantom table being periodically dumped as for the initial analysis with the same parameters.
- the FOCUS will handle the historicity of the records by deleting previous occurrences, as it does with all other documents.
- the second solution is more complex but also more efficient. It consists in reading the internal logging format of DBMSes to achieve the same result. It must be noted that the first solution implies a slight modification for all applications using the DBMS, but no knowledge of its internal coding, while the second does not imply any interference with the applications, but a knowledge of its internal structure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU11710/01A AU1171001A (en) | 1999-10-26 | 2000-10-26 | A reversed computer system based on access by content instead of access by address and its fully optimized implementation |
JP2001535842A JP2004500628A (en) | 1999-10-26 | 2000-10-26 | Reverse computer system based on content access instead of address access and optimal implementation thereof |
EP00973169A EP1252585A2 (en) | 1999-10-26 | 2000-10-26 | A reversed computer system based on access by content instead of access by address and its fully optimized implementation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16157999P | 1999-10-26 | 1999-10-26 | |
US60/161,579 | 1999-10-26 |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2001033419A2 true WO2001033419A2 (en) | 2001-05-10 |
WO2001033419A9 WO2001033419A9 (en) | 2002-08-29 |
WO2001033419A3 WO2001033419A3 (en) | 2003-05-15 |
Family
ID=22581785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2000/001697 WO2001033419A2 (en) | 1999-10-26 | 2000-10-26 | Access by content based computer system |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1252585A2 (en) |
JP (1) | JP2004500628A (en) |
AU (1) | AU1171001A (en) |
WO (1) | WO2001033419A2 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1996023265A1 (en) * | 1995-01-23 | 1996-08-01 | British Telecommunications Public Limited Company | Methods and/or systems for accessing information |
WO1998009229A1 (en) * | 1996-08-30 | 1998-03-05 | Telexis Corporation | Real time structured summary search engine |
WO1998041934A1 (en) * | 1997-03-17 | 1998-09-24 | British Telecommunications Public Limited Company | Re-usable database system |
WO1999021108A1 (en) * | 1997-10-21 | 1999-04-29 | British Telecommunications Public Limited Company | Information management system |
WO1999048026A1 (en) * | 1998-03-16 | 1999-09-23 | Siemens Aktiengesellschaft | System and method for searching on inter-networked computers with stocks of information using software agents |
-
2000
- 2000-10-26 AU AU11710/01A patent/AU1171001A/en not_active Abandoned
- 2000-10-26 EP EP00973169A patent/EP1252585A2/en not_active Withdrawn
- 2000-10-26 WO PCT/IB2000/001697 patent/WO2001033419A2/en not_active Application Discontinuation
- 2000-10-26 JP JP2001535842A patent/JP2004500628A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1996023265A1 (en) * | 1995-01-23 | 1996-08-01 | British Telecommunications Public Limited Company | Methods and/or systems for accessing information |
WO1998009229A1 (en) * | 1996-08-30 | 1998-03-05 | Telexis Corporation | Real time structured summary search engine |
WO1998041934A1 (en) * | 1997-03-17 | 1998-09-24 | British Telecommunications Public Limited Company | Re-usable database system |
WO1999021108A1 (en) * | 1997-10-21 | 1999-04-29 | British Telecommunications Public Limited Company | Information management system |
WO1999048026A1 (en) * | 1998-03-16 | 1999-09-23 | Siemens Aktiengesellschaft | System and method for searching on inter-networked computers with stocks of information using software agents |
Non-Patent Citations (2)
Title |
---|
GIGER H P: "CONCEPT BASED RETRIEVAL IN CLASSICAL IR SYSTEMS" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL. (SIGIR). GRENOBLE, JUNE 13 - 15, 1988, NEW YORK, ACM, US, vol. CONF. 11, 13 June 1988 (1988-06-13), pages 275-289, XP000295044 * |
LAWRENCE S ET AL: "Inquirus, the NECI meta search engine" COMPUTER NETWORKS AND ISDN SYSTEMS, NORTH HOLLAND PUBLISHING. AMSTERDAM, NL, vol. 30, no. 1-7, 1 April 1998 (1998-04-01), pages 95-105, XP004121436 ISSN: 0169-7552 * |
Also Published As
Publication number | Publication date |
---|---|
AU1171001A (en) | 2001-05-14 |
WO2001033419A9 (en) | 2002-08-29 |
JP2004500628A (en) | 2004-01-08 |
EP1252585A2 (en) | 2002-10-30 |
WO2001033419A3 (en) | 2003-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11573996B2 (en) | System and method for hierarchically organizing documents based on document portions | |
Jarlbrink et al. | Cultural heritage as digital noise: nineteenth century newspapers in the digital archive | |
Kowalski et al. | Information storage and retrieval systems: theory and implementation | |
Kowalski | Information retrieval systems: theory and implementation | |
US6163775A (en) | Method and apparatus configured according to a logical table having cell and attributes containing address segments | |
EP0638870A1 (en) | Information retrieval method | |
WO2005124599A2 (en) | Content search in complex language, such as japanese | |
US20020083045A1 (en) | Information retrieval processing apparatus and method, and recording medium recording information retrieval processing program | |
Cutting et al. | An object-oriented architecture for text retrieval. | |
JPH0484271A (en) | Intra-information retrieval device | |
JPH0628403A (en) | Document retrieving device | |
JPH0782504B2 (en) | Information retrieval processing method and retrieval file creation device | |
TWI290684B (en) | Incremental thesaurus construction method | |
WO2001033419A2 (en) | Access by content based computer system | |
JPH06231178A (en) | Document retrieving device | |
Maarek | Automatically constructing simple help systems from natural language documentation | |
JP3579945B2 (en) | Hierarchical item search device and hierarchical item search method | |
Karthik et al. | An Efficient Approach to Retrieve Information for Desktop Search Engine | |
Watson et al. | OSTI Semantic Thesaurus v1 | |
Bonora et al. | RePIM in LOD: semantic technologies to preserve knowledge about Italian secular music and lyric poetry from the 16th-17th centuries | |
Nowak | Semantic search: design and implementation of a vertical search service | |
Visschedijk et al. | Unconventional text retrieval systems | |
JP3281361B2 (en) | Document search device and document search method | |
Fox | CD-ROM/Microcomputer Systemsi' | |
Abrouk et al. | The NEUMA Project: Towards Cooperative On-line Music Score Libraries |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase in: |
Ref country code: JP Ref document number: 2001 535842 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000973169 Country of ref document: EP |
|
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/4-4/4, DRAWINGS, REPLACED BY NEW PAGES 1/3-3/3; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 2000973169 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000973169 Country of ref document: EP |