Ello friends,
I feel awfully silly for asking this... but, after struggling with this issue for some time now I've decided that another pair of eyes may help to illuminate my issue.
I'm trying to loop through two records and one map (I could possibly rewrite the map to be a record as well, but I have no need) simultaneously, compare some entries, and change values if the entries match. What I have is similar to this:
EDIT: Here's an attempt to specifically describe what I'm doing. However, now that I think about it perhaps this isn't the best way to go about it.
I'm attempting to create a restaurant-selection inference engine for an AI course using clojure. I have very little experience with clojure but initially wanted to create struct called "restaurant" so that I could create multiple instances of it. I read that structs in clojure are deprecated so to use records instead. Both the restaurants that are read in from the text file and the user input are stored as 'restaurant' records.
I read in, from a previously sorted text file database, attributes of the restaurants in question (name, type of cuisine, rating, location, price, etc..) and then put them into a vector.
Each attribute has a weight associated with it so that when the user enters search criteria the restaurants can be sorted in order of most to least relevant based on what is most likely to be the most heavily weighted items (for example, the restaurant name is the most important item, followed by type of cuisine, then the rating, etc..). The record therefore also has a 'relevance' attribute.
(defrecord Restaurant [resturant-name cuisine
rating location
price relevance])
;stuff
;stuff
;stuff
(defn search
[restaurants user-input]
(def ordered-restaurants [])
(doseq [restaurant restaurants]
(let [restaurant-relevance-value 0]
(doseq [input user-input
attributes restaurant
attribute-weight weights]
(cond
(= input (val attributes))
(def restaurant-relevance-value (+ restaurant-relevance-value
(val attribute-weight)))))
(assoc restaurant :relevance restaurant-relevance-value)
(def ordered-restaurants (conj ordered-restaurants restaurant))))
(def ordered-restaurants (sort-by > (:relevance ordered-restaurants)))
ordered-restaurants)
;more stuff
;more stuff
;more stuff
(defn -main
[& args]
(def restaurant-data (process-file "resources/ResturantDatabase.txt"))
(let [input-values (Restaurant. "Italian" "1.0" "1.0" "$" "true"
"Casual" "None" "1")]
(println :resturant-name (nth (search restaurant-data input-values) 0))))
So the idea is that each restaurant is iterated though and the attribute values are compared to the user's input values. If there is a match then the local relevance-value variable get added to it the associated weight value. After which it is put into a vector, sorted, and returned. The results can then be displayed to the user. Ideally the example in -main would print the name of the most relevant restaurant.
As the comments explain, you have to get to grips with the immutable nature of the native Clojure data structures. You should not be using doseq - which works by side effects - for your fundamental algorithms.
The relevance attribute is not a property of Restaurant. What you want is to construct a map from restaurant to relevance, for some mode of calculating relevance. You then want to sort this map by value. The standard sorted-map will not do this - it sorts by key. You could sort the map entries thus, but there is a ready made priority map that will give you the result you require automatically.
You also have to decide how to identify restaurants, as with a database. If resturant-names are unique, they will do. If not, you may be better with an artificial identifier, such as a number. You can use these as keys to a map of restaurants and the maps of relevene you onstruct to order.
Related
Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.
I'm programming in Objective-C, but a language-agnostic answer would work fine here. I've got a list of objects with many attributes, including a date of creation and a user GUID. I'm looking for a reasonably efficient way to filter this list to include only the most recent entry from each user ID. Is there a solution better than O(n^2)? I think I could check each element, and if it's an ID I have not yet processed, grab all the objects with the same ID, find the most recent, and store that value elsewhere, but this seems like a naive approach.
If you just want to beat O(n^2) then you can sort by (ID, time) and then iterate through and the first time you see the ID, append it to some answer list. This will be O(n log n).
Alternatively, create a Hash table and iterate through the list. Check if the item is in the map (by ID), if it is then replace it with the current if it is less-recent. For a perfect hash function this would be O(n).
My question :
I need to know if what i'm doing is the best way, and if it's not, what is?
The situation :
I have "Contacts" objects in an array. These contacts must be ordered alphabetically and can have multiple phone numbers. I'm splitting that array into 27 arrays of contacts where each of them reprensents a letter of the alphabet. So i have all my "A" contacts, then "B" and so on.
Also, during the "splitting", I also add a reference of each contact in a dictionary, where the object is the contact, and the key is his phone number.
Because one contact can have X phone numbers, there can be X times the same contact in X different entries in the dictionary. I need that so i can find any contact with any number.
All of the above works like a charm.
Now I need to compare all those numbers from my online database (note: i'm using parse), to see if some of these contacts are already users or not. If they are, they need to be put in a specific section of my tableview. (my tableview is just all the contacts, separated in letter sections, + one "user" section). And the contacts can not appear in the user section AND the letter section. If a contact is a user, he must be separated.
What i'm asking vs What i'm doing :
Right now, i'm just re-looping every array and comparing each element to all the users i've found online. This is a lot of looping and looks like a waste of time and resources.
What i would like to do : Somehow cleaning my arrays of the users i've found, considering i have the reference of the contact object in my dictionary.
TL;DR:
My arrays :
users in the first section, then contacts alphabtically
[[user1, user2, user3, ...],[a1,a2,a3,...],[b1,b2,...],...]
My dictionary :
a1 - phone1
a1 - phone2
a1 - phone3
a2 - phone1
a3 - phone1
...
The ultimate question :
I can very easily find the contact object (since i have his number from my online db). If i interact with the a1 from the dictionary, will it also change the a1 in the array of arrays?
More specifically, can i somehow REMOVE IT from the array considering I don't know which one he is in?
I also add a reference of each contact in a dictionary, where the object is the contact, and the key is his phone number.
You need to be very careful with this approach. It is likely to have collisions. While cell phone numbers are often unique, sometimes they're shared. Home and work numbers are often shared. Phone numbers get reassigned, so your database can wind up with duplicates that way, too. And sometimes people just enter mistaken data. You have to make sure your system behaves reasonably and consistently in those cases. Many behaviors are fine, but "pick one at random" is generally not.
You should think carefully here about what is your model and what is your presentation. One data structure should be the single source of truth. Usually that's going to be your big list of contacts. That has nothing at all to do with how it's displayed. We call this the model. Often it's kept track of as an NSSet since order doesn't matter.
Above that you should have a "view model" that deals with grouping and sorting issues. It is not "truth." You should be willing to throw it away anytime the underlying data changes extensively. Its data structures should point to the underlying model, and should be stored in a way that exactly matches what the table view wants. Keeping the model and the view model separate is one of the best ways to keep your complexity under control. Then you know that there is exactly one place that data can change (the model), and everything else just reacts to that.
Regarding your partitioning problem: so you have a list of contacts and you want to separate them into two groups based on whether any of their phone numbers appear in another list. If your total list is only a few dozen entries long, frankly it doesn't matter how you do this. Even O(n^2) is fine for small enough n. Focus on making it simple and reliable first, then profile with Instruments to see where the real bottlenecks are.
That said, usually the fastest way to determine set intersection is to sort both sets and walk through them both at the same time. So you'd create a "contacts" array of "phone number + contact pointer" and a "users" array of just phone numbers. Sort them both by phone number. Walk over them, comparing the current element of each list, and then incrementing the index of the smaller one. If no match, put the contact in one list. If a match, put it in the other.
But I'd probably just stick all the phone numbers in a set and use member: to look them up. It's just usually easier.
I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.
In my Quest to understanding Mnesia, I still struggle with thinking in relational terms. So I will put my struggles up here and ask for the best way to solve them.
one-to-many-relations
Say I have a bunch of people,
-record(contact, {name, phone}).
Now, I know that I can define phone to always be saved as a list, so people can have multiple phone numbers, and I suppose that's the way to do it (is it? How would I then look this up the other way around, say, finding a name to a number?).
many-to-many-relations
now let's suppose I have multiple groups I can put people in. The group names don't have any significance, they are just names; the concept is "unix system groups" or "labels". Naively, I would model this membership as a proplist, like
{groups [{friends, bool()}, {family, bool()}, {work, bool()}]} %% and so on...
as a field within the "contact" record from above, for example. What is the best way to model this within mnesia if I want to be able to lookup all members of a group based on group name quickly, and also want to be able to lookup all group an individual is registered in? I also could just model this as a list containing just the group identifiers, of course. For use with mnesia, what is the best way to model this?
I apologize if this question is dumb. There's plenty of documentation on mnesia, but it's lacking (IMO) some good examples for the overall use.
For the first example, consider this record:
-record(contact, {name, [phonenumber, phonenumber, ...]}).
contact is a record with two fields, name and phone where phone is a list of phone numbers. As user425720 said it could make sense to store these as something else than strings, if you have extreme requirements for small storage footprint, for example.
Now here comes the part that is hard to "get" with key-value stores: you need to also store the inverse relationship. In other words, you need something similar to the following:
-record(phone, {phonenumber, contactname}).
If you have a layer in your application to abstract away database handling, you could make it always add/change the phone records when adding/changing a contact.
--
For the second example, consider these two records:
-record(contact, {uuid, name, [group_id, group_id]}).
-record(group, {uuid, name, [contact_id, contact_id]}).
The easiest way is to just store ids pointing to the related records. As Mnesia has no concept of referential integrity, this can become out of sync if you for example delete a group without removing that group from all users.
If you need to store the type of group on the contact record, you could use the following:
-record(contact, {name, [{family, [group_id, group_id]}, {work, [..]}]}).
--
Your second problem could also be solved by using a intermediate record, which you can think of as "membership".
-record(contact, {uuid, name, ...}).
-record(group, {uuid, name, ...}).
-record(membership, {contact_uuid, group_uuid}). # must use 'bag' table type
There can be any number of "membership" records. There will be one record for every users group.
First of all, you ask for key-value store design patters. Perfectly fine.
Before I will try to answer your question lets make it clear - what is Mnesia. It is k-v DB, which is included in OTP. Because it is native, it is very comfortable to use from Erlang. But be careful. This is old database with very ancient assumptions (e.g. data distribution with linear hashing). So go ahead, learn and play with it, but for production take your time and browse NoSQL shop to find the best for your needs.
#telephone example. Do not store stuff as strings (list()) - it is very heavy for GC. I would make couple fields like phone_1 :: < < binary > > , phone_2 :: < < binary > >, phone_extra :: [ < < binary > > ] and build index on the most frequent query-field. Also mnesia indicies are tricky - when node crashes and goes up, they need to rebuild themselves (it can take awfully lot of time).
#family example. It quite hard with flat namespace. You may play with more complex keys.. Maybe create separate table for TheGroup and keep identifiers of members? Or each member would have ids of groups he belongs (hard to maintain..). If you want to recognize friends I would implement some sort of contract before presenting data (A is B's friend iff B is A's friend) - this approach would cope with eventual consistency and conflicts in data.