I'm asking about the question found here http://www.geeksforgeeks.org/flipkart-interview-set-2-sde-2/
"(1) There is a stream of characters and at any time we need to find and remove (means set occurrence = 0) character which has maximum occurrence till now. Design data structure and algorithm for same. (I used standard Heap and Hash table setup, then was asked if we can replace lg(n) Heap operations with some efficient operation in practical scenario. I came up with doubly linked list and moving character to front on basis of its occurrences)."
I can't begin to understand the question. Any character occurring for the first time has the maximum occurrence count so far (1 > 0) so every character should get removed every time. Does anyone have a clue what the person might have actually meant?
find-first-non-repeating-character-stream-characters
Define data structure LinkedHashMap<charecter,count of occurances>
Traverse along stream of characters if entry exists increment the
count by one else add a new entry with count as one.
After traversal is over first character with count 1 will be first
non-repeating character
Related
I'm scanning through the order list using the standard OrderSelect() function. Since there is a great function to get the current _Symbol for an order, I expected to find the equivalent for finding the timeframe (_Period). However, there is no such function.
Here's my code snippet.
...
for (int i=orderCount()-1; i>=0; i--) {
if (OrderSelect(i, SELECT_BY_POS, MODE_TRADES)) {
if (OrderMagicNumber()==magic && OrderSymbol()==_Symbol ) j++;
// Get the timeframe here
}
}
...
Q: How can I get the open order's timeframe given it's ticket number?
In other words, how can I roll my own OrderPeriod() or something like it?
There is no such function. Two approaches might be helpful here.
First and most reasonable is to have a unique magic number for each timeframe. This usually helps to avoid some unexpected behavior and errors. You can update the input magic number so that the timeframe is automatically added to it, if your input magic is 123 and timeframe is M5, the new magic number will be 1235 or something similar, and you will use this new magic when sending orders and checking whether a particular order is from your timeframe. Or both input magic and timeframe-dependent, if you need that.
Second approach is to create a comment for each order, and that comment should include data of the timeframe, e.g. "myRobot_5", and you parse the OrderComment() in order to get timeframe value. I doubt it makes sense as you'll have to do useless parsing of string many times per tick. Another problem here is that the comment can be usually changed by the broker, e.g. if stop loss or take profit is executed (and you need to analyze history), and if an order was partially closed.
One more way is to have instances of some structure of a class inherited from CObject and have CArrayObj or array of such instances. You will be able to add as much data as needed into such structures, and even change the timeframe when needed (e.g., you opened a deal at M5, you trail it at M5, it performs fine so you close part and virtually change the timeframe of such deale to M15 and trail it at M15 chart). That is probably the most convenient for complex systems, even though it requires to do some coding (do not forget to write down the list of existing deals into a file or deserialize somehow in OnDeinit() and then serialize back in OnInit() functions).
Let's set the context/limitations:
A linked-list consists of Node objects.
Nodes only have a reference to their next node.
A reference to the list is only a reference to the head Node object.
No preprocessing or indexing has been done on the linked-list other than construction (there are no other references to internal nodes or statistics collected, i.e. length).
The last node in the list has a null reference for its next node.
Below is some code for my proposed solution.
Node cursor = head;
Node middle = head;
while (cursor != null) {
cursor = cursor.next;
if (cursor != null) {
cursor = cursor.next;
middle = middle.next;
}
}
return middle;
Without changing the linked-list architecture (not switching to a doubly-linked list or storing a length variable), is there a more efficient way to find the middle element of singly-linked list?
Note: When this method finds the middle of an even number of nodes, it always finds the left middle. This is ideal as it gives you access to both, but if a more efficient method will always find the right middle, that's fine, too.
No, there is no more efficient way, given the information you have available to you.
Think about it in terms of transitions from one node to the next. You have to perform N transitions to work out the list length. Then you have to perform N/2 transitions to find the middle.
Whether you do this as a full scan followed by a half scan based on the discovered length, or whether you run the cursor (at twice speed) and middle (at normal speed) pointers in parallel is not relevant here, the total number of transitions remains the same.
The only way to make this faster would be to introduce extra information to the data structure which you've discounted but, for the sake of completeness, I'll include it here. Examples would be:
making it a doubly-linked list with head and tail pointers, so you could find it in N transitions by "squeezing" in from both ends to the middle. That doubles the storage requirements for pointers however so may not be suitable.
having a skip list with each node pointing to both it's "child" and its "grandchild". This would speed up the cursor transitions resulting in only about N in total (that's N/2 for each of cursor and middle). Like the previous point, there's an extra pointer per node required for this.
maintaining the length of the list separately so you could find the middle in N/2 transitions.
same as the previous point but caching the middle node for added speed under certain circumstances.
That last point bears some extra examination. Like many optimisations, you can trade space for time and the caching shows one way to do it.
First, maintain the length of the list and a pointer to the middle node. The length is initially zero and the middle pointer is initially set to null.
If you're ever asked for the middle node when the length is zero, just return null. That makes sense because the list is empty.
Otherwise, if you're asked for the middle node and the pointer is null, it must be because you haven't cached the value yet.
In that case, calculate it using the length (N/2 transitions) and then store that pointer for later, before returning it.
As an aside, there's a special case here when adding to the end of the list, something that's common enough to warrant special code.
When adding to the end when the length is going from an even number to an odd number, just set middle to middle->next rather than setting it back to null.
This will save a recalculation and works because you (a) have the next pointers and (b) you can work out how the middle "index" (one-based and selecting the left of a pair as per your original question) changes given the length:
Length Middle(one-based)
------ -----------------
0 none
1 1
2 1
3 2
4 2
5 3
: :
This caching means, provided the list doesn't change (or only changes at the end), the next time you need the middle element, it will be near instantaneous.
If you ever delete a node from the list (or insert somewhere other than the end), set the middle pointer back to null. It will then be recalculated (and re-cached) the next time it's needed.
So, for a minimal extra storage requirement, you can gain quite a bit of speed, especially in situations where the middle element is needed more often than the list is changed.
Sorry, I made a mistake in my earlier question. Because of that I didn't get the answer I wanted.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?
Do we need to absolutely have splitting to get a run-time of log n? I don't think it makes any logical sense to say the run-time of n when the maximum run-time of the loop is n/2.
I think some concepts need a bit of refining here, because the time complexity is only related to algorithm, not to the size of the data structure you're operating on.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n.
Now, traversing an array, like
for (int i = 0; i < array.size; i++) {
variable = array[i];
}
runs in O(n): the time needed to perform such an operation varies linearly with the size of the array. You will have O(log n) for operations like a binary search on an array, but you cannot generalize this concept to all array operations, and especially not to those who need to iterate over the array.
Now, this sentence
For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n.
leads me to believe that you think that the n as used in big O and what you call the "nth element" are directly related. They aren't. On a linked list your only option to go to element n is to go to the start of the list and follow the links down the element you're looking for (or in the case of a double linked list, go to the start or end depending on the position of the element you're looking for), so this operation has a time complexity of O(n), ie linearly related to the length of the collection.
This question was asked in one of the big software company. I have come up with a simple solution and I want to know what others feel about the solution.
You are supposed to design an API and a backend for a system that can
allot phone numbers to people living in a city. The phone numbers will
start from 111-111-1111 and end at 999-999-9999. The API should enable
the clients (people in the city) to do the following:
When a client requests for a phone number, it allots one of the available numbers to them.
Some clients may want fancy numbers, so they can specifically ask for a number to be alloted to them. If the requested number is
available then the system allots it to them, otherwise the system
allots any available number.
The system need not have to know which number is alloted to which
client. The same client may make successive requests and get multiple
phone numbers for himself, but the system is not bothered. At any
point of time, the system only knows which phone numbers are alloted
and which phone numbers are free.
The numbers from 111-111-1111 to 999-999-9999 roughly corresponds to 8 billion numbers. Assuming that memory is not a constraint, I can think of the following two approaches (which are almost similar).
Maintain a huge boolean array of length 8 billion and have a next pointer that points to an array index (next is initialized to zero). If the value pointed by next is not free, then forward next until a free number is found. When fancy numbers are requested, just check whether the corresponding index position is free and return the number. The downside of this approach is, when allocating numbers in a regular way, if there is a huge chunk (say 1 billion) numbers in the middle that was allocated by fancy allocation, then the next pointer has to be moved 1 billion times.
To overcome the difficulty mentioned in the previos design, we can use some sort of a linked hashmap. We maintain a doubly linked list (this replaces the array in the previous design) and another array of the same length as the list where each element of the array points to a corresponding element in the list. So when allocating numbers in regular method, we advance a pointer in the linked list and mark nodes as and when we allocate (same as the previous method). When allocating fancy numbers, we can directly find the node in the list that corresponds to the special number requested by first indexing into the array and the following the pointer. Once the node is identified, short circuit the previous node and the next node so that we do not have to skip the used numbers one by one (which was the problem with the previous approach) when doing a regular allocation.
Let me know whether I am on the right track. Please enlighten me with any important details that I am missing.
You can do significantly better in the anser to this question.
First you should design you API. The one recommended by Icarus3 is perfectly good:
string acquireNextAvailableNumber();
boolean acquireRequestedNumber(string special);
The second one returns true (and reserves the number) if it is available, otherwise returns false.
The question doesn't specify how you allocate phone numbers, so allocate them to suit yourself. Make the first 'next available' request return "111-111-1111", the next "111-111-1112" etc. This means you can record all the numbers allocated through 'next' by just remembering the last one allocated. (You'll need to ask whether '111-111-1119" is followed by "111-111-1120" or 111-111-1121", but that's the sort of thing you should be asking anyway. In any case, the important thing is you can work out what is the next number knowing the last allocated one.)
Special requests you will need to store individually. A hash table work, but so does a BST or simply an ordered list. It depends on what tradeoffs you want between space and speed, and how often special numbers are likely to be requested. I'll use a BST (ordered by the number) in the rest of this, for reasons I'll come to.
So, how do you code this? For the next allocated number:
Look at the last allocated number, and find the next in sequence.
Check that number hasn't been allocated as a special number. You can do this very quickly with a BST because if it's there, it will be the lowest entry in the BST.
If the number was in the 'special numbers' database, increment the 'allocated numbers' value (to include that number) and remove the entry from the special numbers. Then repeat this process until you get a number that isn't in the special numbers.
Note that this process ensures that all 'special numbers' lower than the last one allocated by 'next' do not appear in the special numbers database. As the 'last normal number allocated' increases, it absorbs any special numbers allocated that were less than that, removing them from the table. This is what ensures that when we ask whether the next number in sequence is in the special numbers database, we only have to look at the lowest entry.
Checking for a special number is easy. If it is lower than the last 'normal' number allocated it isn't available. Otherwise you check to see if it exists in the BST. If it doesn't, you add it to the BST.
You can optimize this process by storing not just single numbers in the BST, but storing ranges of numbers. If the allocated special numbers are dense, then it reduces the amount of space in the tree and the number of accesses to find if one is in there. During the test to find if the 'next' number discovers a rnage of size n, then you can immediately increment the highest normal number by n, instead of having to go round the loop n times.
First, you did not prototype your APIs. For example, if I have to design these APIs I will publish 2 APIs.
string acquireNextAvailableNumber();
string acquireRequestedNumber(string special);
Second, you need to decide how you are going to implement it. code driven or data driven ?
You can maintain hash for all these numbers ( it will consume memory ) and quickly query the availability of the number. Or
you could maintain single list to store only distributed numbers ( less memory ). So, whenever request comes, you start searching 1 to n numbers in that list ( increased time-complexity ). if any first (or requested) number isn't there then you allocate it to client and add that entry in the list.
As, there are billion numbers, you will need to consider the trade-off between space and time.
You could also take the advantage of the database.
To enhance previous answers, any BST may not be good enough as insertions or deletions can make it unbalanced. A balanced BST, e.g. Red-Black Tree, should be a good choice.
So, a Red-Black Tree can be created and filled in the beginning to represent available numbers, and each allocation should remove an element from it.
init(from, to) - can be done in O(n) time, a straightforward implementation would be O(n log n). But that is a one-time initialization on your server's start
acquireNextAvailableNumber() - should remove smallest element, time cost O(1)
acquireRequestedNumber(special) - should make a search and remove element if found, guaranteed time cost O(log n)
In Java, a TreeSet<String> or TreeSet<Integer> could be used since it is implemented with Red-Black Tree.
The next question would probably have been that several request-processing threads would access your API, so since Java's TreeSet is not thread-safe, you should have wrapped it at initialization like so:
TreeSet numbers = init(...);
SortedSet availableNumbers = Collections.synchronizedSortedSet(numbers);
I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?
Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?
I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.
You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.