Delphi - What Structure allows for SAVING inverted index type of information? - delphi

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.

Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Related

database design for dictionary of words

(my reason for asking this question is based on having read this answer, which made me rethink my current setup)
I currently am developing a ruby on rails application in which there are many languages, each of which has a dictionary of base words attached to it, as well as a list of the words that map to each base word. The way I currently have it set up, there is a base_words table that contains the base_word as a string, along with the language_id as a foreign key. There is also a words table, each row of which contains a word string, along with the base_word_id as a foreign key. There is also a language_id indexed on each column, although I'm almost positive that this is superfluous due to the language_id on base_word, so I'm planning to take it off (although this could be a bad assumption on my part).
In sum, on the contrary to the answer I mentioned in the beginning, the tables are not separated by language, because I've reasoned that I can simply pull out the language words programmatically when the time comes. However, my application will also have translation(s) associated with each base word (as did the answer I referenced), and so I'm doubting my structure due to the realization that each translation will actually be a base_word in the same table as itself, which would mean that the translation would actually be just an id of another base word in said table. This may be completely fine, or it may not be - I have no clue (this is my first ever programming project).
Is this ok? Do I need to separate my base_words into separate tables for each language, or can I leave it all in one table?
Another example: I also need to store many phrases for each language, along with their translations. Should I have one table where each row has the appropriate translation of the phrase, or one table where each row contains simply one phrase and a language_id, or multiple tables (one for each language)?
Un saludo,
Michael
As in the other scenario, you'll have a translations table. There is no technical reason it couldn't have multiple foreign keys to base_words (a source_word_id and target_word_id, perhaps). So yes, you can absolutely store all your words in one table. There are some minor side effects involved with translations being directional relationships: it becomes possible to have translations which only work one way, and there will be many pairs of entries with opposite source and target. Neither of these is much of a worry: the first is even potentially desirable in order to represent words with double meanings in one language but not the other, and as for the second, space is cheap and indexing is easy.
You are correct that you do not need words.language_id, so long as you always join base_words when you're querying words and the language matters. This obviously changes if you have a use case where it makes sense to leave base_words out, but that scenario sounds unlikely based on what you describe.
As for phrases: why should they be handled any differently than base_words?

how to create a replicable, unique code for a pre-ISBN book

I am putting my collection of some 13000 books in a mySQL database. Most of the copies I possess
can be identified uniquely by ISBN. I need to use this distinguishing code as a foreign key into
another database table.
However, quite a few of my books date from pre-ISBN ages. So for these, I am trying to devise a
scheme to uniquely assign a code, sort of like an SKU.
The code would be strictly for private use. It should have the important property that, when I
obtain a pre-ISBN publication, I could build the code from inspecting the work, and based on the
result search the database to see if I already have other copies in my possession.
Many years ago I think I saw a search scheme for some university(?) catalogue, where you could
perform a search of a title based on a concatenated string' (or code) that was made up of let's
say 8 letters from the title, and 4 from the author, and maybe some other data. For example,
to search 'The Nature of Space and Time' by Stephen Hawking and Roger Penrose you might perform
a search on the string 'Nature SHawk', being comprised of 8 characters from the title (omitting
non-filing words and stopwords) and 4 from the author(s).
I haven't been able to find any information on such scheme's, or whether or not such an approach
was standardized in any way.
Something along these lines could be made up of course, but I was wondering if people here have
heard of such schemes, of have ideas on how to come to a solution to this.
So keep in mind the important property of 'replicability': using the scheme, inspection of a pre-
ISBN dated work should --omitting very special or exclusive cases-- in general lead to a code
that can singly be used to subsequently determine if such a copy is already in the database.
Thank you for your time.
Just use the Title (add Author and Publisher as options) and a series id to produce a fake isbn. Take a look at fake_isbn.
NOTE: use the first digit as a series id but don't use 9!

Removing specific objects in arrays from references in dictionary

My question :
I need to know if what i'm doing is the best way, and if it's not, what is?
The situation :
I have "Contacts" objects in an array. These contacts must be ordered alphabetically and can have multiple phone numbers. I'm splitting that array into 27 arrays of contacts where each of them reprensents a letter of the alphabet. So i have all my "A" contacts, then "B" and so on.
Also, during the "splitting", I also add a reference of each contact in a dictionary, where the object is the contact, and the key is his phone number.
Because one contact can have X phone numbers, there can be X times the same contact in X different entries in the dictionary. I need that so i can find any contact with any number.
All of the above works like a charm.
Now I need to compare all those numbers from my online database (note: i'm using parse), to see if some of these contacts are already users or not. If they are, they need to be put in a specific section of my tableview. (my tableview is just all the contacts, separated in letter sections, + one "user" section). And the contacts can not appear in the user section AND the letter section. If a contact is a user, he must be separated.
What i'm asking vs What i'm doing :
Right now, i'm just re-looping every array and comparing each element to all the users i've found online. This is a lot of looping and looks like a waste of time and resources.
What i would like to do : Somehow cleaning my arrays of the users i've found, considering i have the reference of the contact object in my dictionary.
TL;DR:
My arrays :
users in the first section, then contacts alphabtically
[[user1, user2, user3, ...],[a1,a2,a3,...],[b1,b2,...],...]
My dictionary :
a1 - phone1
a1 - phone2
a1 - phone3
a2 - phone1
a3 - phone1
...
The ultimate question :
I can very easily find the contact object (since i have his number from my online db). If i interact with the a1 from the dictionary, will it also change the a1 in the array of arrays?
More specifically, can i somehow REMOVE IT from the array considering I don't know which one he is in?
I also add a reference of each contact in a dictionary, where the object is the contact, and the key is his phone number.
You need to be very careful with this approach. It is likely to have collisions. While cell phone numbers are often unique, sometimes they're shared. Home and work numbers are often shared. Phone numbers get reassigned, so your database can wind up with duplicates that way, too. And sometimes people just enter mistaken data. You have to make sure your system behaves reasonably and consistently in those cases. Many behaviors are fine, but "pick one at random" is generally not.
You should think carefully here about what is your model and what is your presentation. One data structure should be the single source of truth. Usually that's going to be your big list of contacts. That has nothing at all to do with how it's displayed. We call this the model. Often it's kept track of as an NSSet since order doesn't matter.
Above that you should have a "view model" that deals with grouping and sorting issues. It is not "truth." You should be willing to throw it away anytime the underlying data changes extensively. Its data structures should point to the underlying model, and should be stored in a way that exactly matches what the table view wants. Keeping the model and the view model separate is one of the best ways to keep your complexity under control. Then you know that there is exactly one place that data can change (the model), and everything else just reacts to that.
Regarding your partitioning problem: so you have a list of contacts and you want to separate them into two groups based on whether any of their phone numbers appear in another list. If your total list is only a few dozen entries long, frankly it doesn't matter how you do this. Even O(n^2) is fine for small enough n. Focus on making it simple and reliable first, then profile with Instruments to see where the real bottlenecks are.
That said, usually the fastest way to determine set intersection is to sort both sets and walk through them both at the same time. So you'd create a "contacts" array of "phone number + contact pointer" and a "users" array of just phone numbers. Sort them both by phone number. Walk over them, comparing the current element of each list, and then incrementing the index of the smaller one. If no match, put the contact in one list. If a match, put it in the other.
But I'd probably just stick all the phone numbers in a set and use member: to look them up. It's just usually easier.

How do database indices make search faster

I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.

drop-down combo box-like functionality for queries

Is it possible to provide the following type of functionality with informix client tools?
As the user types the first two characters of a name, the drop-down list is empty. At the third character, the list fills with just the names beginning with those three characters. At the fourth character, MS-Access completes the first matching name (assuming the combo's AutoExpand is on). Once enough characters are typed to identify the customer, the user tabs to the next field.
The time taken to load the combo between keystrokes is minimal. This occurs once only for each entry, unless the user backspaces through the first three characters again.
If your list still contains too many records, you can reduce them by another order of magnitude by changing the value of constant conSuburbMin from 3 to 4.
This requires a combination of two things, only one of which is partially under the control of Informix the DBMS or Informix the Client API.
First of all, you need the gadget that is accepting user input to asynchronously generate a query which matches what the user has typed, fetches some of the results from the DBMS, and shows them. Secondly, you need the DBMS to respond rapidly to such queries. Part of the issue is 'what form does the query take'. But the basic functionality is:
SELECT TitleCaseName
FROM ReferenceTable
WHERE LowerCaseName[1,3] = 'abc';
You might or might not bother with 'first rows optimization'; you might or might not bother with an ORDER BY. Your code would only select the first N rows. You might do it with some prioritization information - most frequently used names, etc.
But this is logic is basically the same for any DBMS - give or take the details such as the choice of technique for dealing with case-mapping (function call vs column) and notation for substrings vs LIKE 'abc%'.
The tricky stuff, though, is the asynchronous combination of user-input plus collecting data from the DBMS; that is best handled with multiple threads, one dealing with the user input, one dealing with the DBMS and (possibly) one dealing with the display (or that might also be the one dealing with user input). And that requires hooking into the UI API - not something that the Informix APIs do of their own accord. The UI can get at Informix (or any other DBMS) easily enough through ODBC or any other faintly similar API.

Resources