which is faster to find a random string: random line order or sorted? - grep

We want to find a random string, e.g.: "ASDF555". We have a very BIG file with unique lines containing this string. Which one is faster (in time, with an easy grep command) to find the mentioned string? If the "BIG file" is:
sorted
or random?
Of course, the ASDF555 could be anything!
We are thinking of that it's faster to have the lines in random order, since the string could be random too. But we cannot prove this idea..

grep does not "know" your file is sorted, so it needs to go over it line by line - so the fact it's sorted is inconsequential. To rephrase - the fact that a file is sorted cannot harm your search speed - you can also go over a file line by line until you find the desired string.
However, if the file is indeed sorted, you may implement a better searching algorithm (e.g., binary searching) instead of using grep.

Related

How to write a hashmap to a file in a memory efficient format?

I am writing a Huffman Coding/Decoding algorithm and I am running into the problem that the storing the Huffman tree is taking up way to much room. Currently, I am converting the tree into a hashMap as such -> hashMap<Character(s),Huffman Code> and then storing that hash map. The issue is that, while the string is compressed great, adding the Huffman Tree data stored in the hash map is adding so much overhead that it's actually ending up bigger than the original. Currently I am just naively writing [data, value] pairs to the file, but I imagine there must be some sort of trickier way to do that. Any ideas?
You do not need the tree in order to encode. All you need is the bit lengths for each symbol and a way to order the symbols. See Canonical Huffman Code.
In fact, all you need is the symbols that are coded ordered by bit length, and within bit length sorted by symbol, and then the number of codes of each length. With just those two things you can encode.

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

How to match efficiently against keys in a table in Lua?

Available in my Lua 5.1 environment are obviously the default Lua pattern matching, but also a reasonably recent version of PCRE and LPEG. I don't honestly care which of these is used; as long as my problem is tackled in an efficient manner I'm happy. (My personal knowledge of LPEG especially is next to non-existent, but I hear it has some very good qualities.)
I have a table with certain string patterns as keys, the accompanying values are to be used once the keys matches... which means they aren't really important for this matter.
Suppose you have:
tbl = { ["aaa"] = 12, ["aab"] = 452, ["aba"] = -2 }
Now my goal is to find out which one of these matches first in a particular string like "accaccaacaadacaabacdaaba".
In reality, the keys are more numerous and the match string is considerably lengthier. This means simply matching against all keys one by one and compare the column the match begins at is a very inefficient solution that is not viable for me.
Parts of the match strings can have considerable overlaps, too. From the theory, I know one state machine per key pattern would be ideal in this regard; just go through the motions on every pattern and the moment you have a complete match on one of them you are done.
But I would be crazy to go code something like that myself when there's so many pattern matching libraries in my environment. The only one I know is technically capable is PCRE; just append the keys like "aaa|aab|aba" and you'll get the first feasible match.
But there's also the problem. For one, I am unsure how intelligent it is when compiling such a match. (I think it first tries 'aaa', unwinds completely once it fails, then completely tries aab, but I haven't tested) which wouldn't be too efficient compared to matching it like "a(a[ab]|ba)" where similarities get resolved faster.
Additionally, I'd like to have the capacity to put in some flexibility ("a.ad" where the second character doesn't matter, or matches a number.. basic stuff like that). With a pattern like that in such an additive approach, I do not see a way to regain the original pattern that matched so I can use the value that goes with it.
(Worst case, I could just generate a lot of entries in the table to match every possible wildcard variation and do away with the pattern requirement, but I honestly don't want to.)
Which library is the right tool for the job, and to boot, how to best use said library to achieve above-stated goals without reinventing the wheel?
A comment to your question mentioned Aho–Corasick algorithm.
If your environment has access to os.execute or io.popen, you can call fgrep -o -f patterns filename, where patterns is the name of a file that contains patterns separated with newlines, and filename is the name of your input. -o means that only matches will be output, one per line. You can replace filename with - so that fgrep reads from standard input: echo "String to match" | fgrep -o -f patterns.
fgrep implements Aho–Corasick algorithm.
However, remember that Aho–Corasick algorithm does not recognise metacharacters.
Just as Alexander Mashin's answer said, Aho–Corasick algorithm is an efficient algorithm that will solve your problem. In Lua land, cloudflare /
lua-aho-corasick is an implementation for LuaJIT using FFI. There's also a pure lua implemetation jgrahamc/aho-corasick-lua which might be slower.

How do I efficiently search through an ordered list?

I have a function that predicts a words being typed and returns the possibilities in an array. Unfortunately those aren’t sorted by frequency used. So I have a list of 10K ordered words listed by most frequent to less frequent. What would be an efficient way to compare the words in the array and the ordered list to return the most frequent one? (i.e the one it encounters first?)
I was tipped off by a friend to use a binary search tree but I really don't see how that helps me. From what I understood from the following website, only numerical values can be used.. Am I wrong in thinking so? Is there a better way of doing the aforementioned task?
Thanks in advance
You could create a dictionary with words as keys and frequencies as values. Then iterate over your result array, use the dictionary to obtain the frequency value for each item, and predict the item with the highest frequency.
I wouldn't use a vanilla binary search tree here. It would be possible - as Taylor Kirkpatrick says, you could just create a tree with words as keys and frequencies and use that to find the frequency for each result word, in much the same way as the dictionary solution.
The problem is that you cannot guarantee that a simple binary tree will be balanced. From the sound of it your data would probably be OK, since your words are in frequency order. The worst case would be if the words were in alphabetic order - then your binary tree would end up being identical to a linked list - it would never branch, since every node would attach to the right of the previous one. So the computational complexity of a search would be the same as iterating over the array of words - O(n) instead of O(log2N) (which is the best case for binary trees).
Of course, you could guard against this by randomising the list of words before doing the insert. But to my mind it's just easier to use a dictionary. I don't know what the actual implementation of Swift dictionaries is (and we won't until they open source it in a couple of months), but you can take it as read that it will out perform a vanilla BT for value retrieval.
I don't know what the background to this problem is - if you are learning CS it might be worth implementing the BST just for intellectual growth - in this case, with only 10,000 items you might find the performance differences are ultimately quite small. But if you are a working programmer trying to solve a problem, go with the dictionary approach.
You put all your words into a dictionary or a set. That's it. Dictionary if you have data associated with the words, set if you have no data and just want to know if the word is in the list or not.
You might want to use a Trie.
Put your word list into it. For every character entered, you traverse the Trie as deeply as you can and then show all paths to leaf nodes as possible completions.
Since the world like you have is likely static, you can precompute the Trie and load from disk/network/whatever at startup if performance is a concern.
You can use a binary search tree with anything as the actual value. To actually make use of the tree, use the frequency of the words as the numerical value. This is actually a pretty good solution to your problem. Each node of the tree will contain this word and a numeric value that represents the frequency of the word.
Here are a few links to help you out with making it.
Hope that helps.

Validate words against an English dictionary in Rails?

I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").

Resources