I normally start my working sessions by moving to the root of my code tree and issuing atom ..
However I work with moderately big trees (in excess of 100.000 files) and I noticed that atom needs to re-index the entire project at each launch.
Is there a way to tell Atom to save the index for the following session [and eventually to reindex based on a diff] ?
Related
I am developing an iOS app and I have this text file with a city name per line. I have like 3 Million cities in that file. In order to be able to perform searches and operations on it I am using a B-Tree but this tree takes a long time to be created. It is not good for the user experience having him to wait for this every time he uses the time. All this without using Core data!
Any tips on how can I speed up this process?
Thank you
My recommendation is that you use SQLite with an index on the fields you want to query (or some other type of permanent, indexed storage) so that the user only has to wait the first time the app is opened, and then you can query the database, which will be much faster. I am also fairly certain that you can install a SQLite database from a pre-generated file, so you might be able to generate this index offline, bundle it with your application, and then the user has no wait time at all. I'm not 100% sure on this options though, so you should investigate.
Either way, there is no magic solution here. If the data you want is on line 2 million of the file, you will have to read 2 million lines of text in order to get to that line. I would recommend finding a way to make the UX of your app acceptable so that the user feels better about waiting for the data to load. If you display some sort of pretty screen with a progress bar while the data indexes, the user will be more forgiving of this wait.
The B-Tree will obviously take some time to be created. If you don't want to use a database but stick with your own B-Tree implementation you could dump the tree data to a separate file and load that when the program starts instead of recreating it every time. However, you will have to update the cached tree every time the source data is modified.
In Python the pickle module can help you, but most programming languages will have a serialisation module.
Does this file come with the Application? If it does then you could already process the file file into an SQLite database. Before you ship the app containing the database. You can then use "Select" statements to search the data using indexed fields (like cityname).
If the file changes. Then Still ship with a database and just send amendments as a file. Which would edit the database to bring it back up to date. You may need to add a command to the file for each line like, REPLACE, NEW, DELETE.
I have a simple jquery which calls a servlet via get and then Neo4j is used to return data in JSON format.
The system is workable after the FIRST query but the very first time it is used the system ins unbelievably slow. This is some kind of initialisation issue. I am using Heroku web hosting.
The code is fairly long so I am not posting now, but are there any known issues regarding the first invocation of Neo4j?
I have done limited testing so far for performance as I had a lot of JSON problems anyway and they only just got resolved.
Summary:
JQuery(LINUX)<--> get (JSON) <---> Neo4j
First Query - response is 10-20 secs
Second Query - time is 2-3 secs
More queries - 2/3 secs.
This is not a one-off; I tested this a few times and always the same pattern comes up.
This is a normal behaviour of Neo4j where store files are mapped into memory lazily for parts of the files that become hot, and becoming hot requires perhaps thousands of requests to such a part. This is a behaviour that has big stores in mind, whereas for smaller stores it merely gets in the way (why not map the whole thing if it fits in memory?).
Then on top of that is an "object" cache that further optimizes access, that get populated lazily for requested entities.
Using an SSD instead of spinning media will usually speed up the initial non-memory-mapped random access quite a bit, but in your scenario I recognize that's not viable.
There are thoughts on beeing more sensitive to hot parts of the store (i.e. memory map even if not as hot) at the start of a database lifecycle, or more precisely have the heat sensitivity be a function of how much is currently memory mapped versus how much can be mapped at maximum. This has shown to make initial requests much more responsive.
If I have large number of files (n x 100K individual files) what would be most efficient way to store them in iOS file system (from speed of access to the file by path point of view)? Should I dump them all in single folder or break them in multilevel folder hierarchy.
Basically this breaks in three questions:
does file access time depend on number of "sibling" files (I think
answer is yes. If I am correct file names are organized into b-tree
so it should be O(log n))?
how expensive is traversing from one folder to another along the
path (is it something like m * O( log nm ) - where m is number of
components in the path and nm is number of "siblings" at each path
component )?
What gets cached at file system level to make above assumptions incorrect?
It would be great if some one had direct experience with this kind of problem and can share some real life results.
You comments will be highly appreciated
This seems like it might provide relevant, hard data:
File System vs Core Data: the image cache test
http://biasedbit.com/blog/filesystem-vs-coredata-image-cache
Conclusion:
File system cache is, as expected, faster. Core Data falls shortly behind when storing (marginally slower) but load times are way higher when performing single random accesses.
For such a simple case Core Data functionality really doesn't pay up, so stick to the file system version.
I think you should store everything is a one folder and create a hash table which include key (file name) and value (source path) pare.By creating hash table complexity with be constant log(1) and this will speed up your process as well.
The file system is not an optimal database. With that many thousands of files, you should consider using Core Data, or other database instead to store the name and contents of each file.
I'm working on an application that works like a search engine, and all the time it has workers in the background searching the web and adding results to the Results table.
While everything works perfectly, lately I started getting huge response times while trying to browse, edit or delete the results. My guess is that the Results table is being constantly locked by the workers who keep adding new data, which means web requests must wait until the table is freed.
However, I can't figure out a way to lower that load on the Results table and get faster respose times for my web requests. Has anyone had to deal with something like that?
The search bots are constantly reading and adding new stuff, it adds new results as it finds them. I was wondering if maybe by only adding the bulk of the results to the database after the search would help, or if it would make things worse since it would take longer.
Anyway, I'm at a loss here and would appreciate any help or ideas.
I'm using RoR 2.3.8 and hosting my app on Heroku with PostgreSQL
PostgreSQL doesn't lock tables for reads nor writes. Start logging your queries and try to find out what is going on. Guessing doesn't help, you have to dig into it.
To check the current activity:
SELECT * FROM pg_stat_activity;
Try the NOWAIT command. Since you're only adding new stuff with your background workers, I'd assume there would be no lock conflicts when browsing/editing/deleting.
You might want to put a cache in front of the database. On Heroku you can use memcached as a cache store very easily.
This'll take some load off your db reads. You could even have your search bots update the cache when they add new stuff so that you can use a very long expiration time and your frontend Rails app will very rarely (if ever) hit the database directly for simple reads.
I've been reading up on the Sphinx search engine and the Thinking Sphinx gem. In the TS docs it says...
Sphinx has one major limitation when compared to a lot of other search services: you cannot update the fields [of] a single document in an index, but have to re-process all the data for that index.
If I understand correctly, that means when a user adds or edits something, the change is not reflected in the index. So if they add a record it won't come up in searches until the entire index is rebuilt. Or if they delete a record, it will come up in searches, and then cause some kind of error or frustrating behavior.
Moreover, while rebuilding the index Sphinx is shut down. So, your app's search functionality goes off line regularly (once an hour, once every few hours), and anyone who tries to do a search then will get an error or a "try later" message.
OK, clearly none of that is acceptable in real-world app. So you pretty much have to use delta indexing.
But apparently you still need to regularly shut down your search engine and do a full indexing...
Turning on delta indexing does not remove the need for regularly running a full re-index, as otherwise the delta index itself will grow to become just as large as the core indexes, and this removes the advantage of keeping it separate. It also slows down your requests to your server that make changes to the model records.
I don't really understand what the docs are saying here. Maybe someone can help me out. I thought the whole point of delta indexing was that you don't need to regularly rebuild the index. It's updated instantly whenever the data changes.
Because rebuilding the index every hour or every anything would be totally messed up, right?
If I understand correctly, that means
when a user adds or edits something,
the change is not reflected in the
index. So if they add a record it
won't come up in searches until the
entire index is rebuilt. Or if they
delete a record, it will come up in
searches, and then cause some kind of
error or frustrating behavior.
Moreover, while rebuilding the index Sphinx is shut down. ...
You don't need to rebuild your indexes - just reindex them. Which means - there's no need to stop the daemon. Rebuilding is only needed after changing the structure of the index - and that is not the case here.
And for the second part - again, you don't rebuild the index, ergo stopping the deamon isn't necessary. When using delta indexing there are actually two indexes that are used for searching - the main index (which should be reindexed once a while) and the delta index (which is refreshed after each relevant operation on the record). If I understand it correctly, when reindexing the main index (eg. via cron task), the delta index is simply merged into the main index, so it won't take that much place and stay fast.