Loading data for Machine learning - machine-learning

I have a dataset with >100,000 data points. I am creating ML model and plots for subset of data every time when it meets certain condition.
Will it be better if i load the data before for loop. Or, load the data every time inside for loop.
In first case it will take less time to run "for loop" because i am not loading the data every time, but memory is allocated for all data entire time.
data = pd.read_csv("sample.csv")
data.drop(['column2', 'column3']
for i in range(0,10):
data['column1'] == i
# performing the machine learning model and plots
In second case i will be loading the dataset every time but only subset of data will be remaining in the memory after i drop columns and subset the data.
for i in range(0,10):
data = pd.read_csv("sample.csv")
data.drop(['column2', 'column3']
data['column1'] == i
Which is a better approach?
I have tried both, but want to know which is correct.

I think in 1st approach: you will insert the data once and it will loops according to the condition.
But in 2nd approach: for each loop it has to loads and drop certain columns of your data which will take a lot of time.
My suggestion is to go with the 1st approach because the run time less and it is the correct way to approach.
Hope it helps your question.

Related

Dask Data Lake is this the right approach?

So I am using Dask to store large amounts of data. We get about 50 million new rows of data a day. Not many columns wide. I currently store the data with ddf.to_parquet(long_term_storage_directory). As I get new data I append this to the long_term_storage_directory directory. Everything works okay but it is slow.
The index that is being used is time I was hoping that as I add data it would simply get added to the long list of parquet files in long_term_storage_directory. (long_term_storage_directory is also index by the same time field) I am worried that the approach I am taking is flawed in some way. Maybe I need to use spark or something else to store the data?
Note:
The ddf_new_data is indexed with the same indexed used in ddf_long_term_storage_directory. I was hoping that since the new data comming in has the same index as what is currently in the long_term_storage_directory that added the data to the long term data store would be faster.
ddf_long_term_storage_directory = dd.read_parquet(path=long_term_storage_directory, engine='pyarrow')
ddf_new_data = dd.read_parquet(path=directory_to_add_to_long_term_storage, engine='pyarrow')
ddf_new_data = ddf_new_data.set_index(index_name, sorted=False, drop=True)
ddf = dd.concat([ddf_long_term_storage_directory, ddf_new_data], axis=0)
ddf = ddf.repartition(partition_size='200MB') #??? Do I need to do this every time I add new data
ddf.to_parquet(long_term_storage_directory)
The simplest answer would be to not load the old data/concat/repartition. That indeed will get slower as more data accumulated. Instead, just write the incoming data to a new, sequentially-numbered file in the same directory.

Is it bad practise to save calculated data into a db as opposed to inputs for the calculation? (Rails)

Is it bad practise to save calculated data into a database record, as opposed to just inputs for the calculation?
Example:
If we're saving results of language tests as a db record, and the test has 3 parts which need to be saved in separate columns: listening_score, speaking_score,writing_score
Is it ok to have a forth column called overall_score, equal to
( listening_score + speaking_score + writing_score ) / 3?
Or should overall_score be recalculated each time current_user wants to look at historical results.
My thinking is that this would cause unnecessary duplication for data in the db. But it would make make extracting data simpler.
Is there a general rule for this?
It's not bad, but it's not good. There's no best practice here, because the answer is different in each situation. There are trade offs for persisting the calculated attributes instead of calculating them as needed. The big factors in deciding on whether to calculate when needed or persist are:
Complexity of calculation
Frequency of changes to dependent fields
Calculated field to be used a search criteria
Volume of calculated data
Usage of calculated fields (eg: operational/viewing one record at a time vs. big data style reporting)
Impact to other processes during calculation
Frequency that calculated fields will be viewed.
There are a lot of opinions on this matter. Each situation is different. you have to determine whether the overhead of persisting your attributes and maintaining their values is worth the extra effort than just calculating it as needed.
Using the factors above, my preference for persisting a calculated attribute increases as
Complexity of calculation goes up
Frequency of changes to dependent fields goes down
Calculated field to be used a search criteria goes up
calculated field are used for complicated reporting
Frequency that calculated fields will be viewed goes up.
The factors I omitted from the second list are dependent on external factors, and are subject to even more variability.
Storing the calculated total could be thought of as caching. Caching calculations like this means you have to start dealing with keeping the calculation up to date and worrying about when it isn't. In the long run, that pattern can result in a lot of work. On the flip side, always calculating the total means you will always have a fresh calculation to work with.
I've seen folks store calculations to address performance issues, when calculating is taking a long time due to its complexity or the complexity of the query its based off of. That's a good reason to start thinking about caching results like this.
I've also seen folks store this value to make queries easier. That's a lower return on investment, but can still be worth it if the columns used in your calculations aren't changing frequently.
My default is to calculate, and I want to see good justification for storing the value of the calculation in another column.
(It may also be worth noting that if you are using the same calculation multiple times in a particular function call, you can memoize the result to increase performance without storing the result in the database.)

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

How to display large list in the view without using database slicing?

I have a service that generates a large map through multiple iterations and calculations from multiple tables. My problem is I cannot use pagination offset to slice the data because the data is coming from multiple tables and different modifications happen on the data. To display this on the screen; I have to send the map with 10-20,000 records to the view and that is problematic with this large dataset.
At this time I have on-page pagination but this is very slow and inefficient.
One thing I thought is to dump it on a table and query it each time but then I have to deal with concurrent users.
My question is what is the best approach to display this list when I cannot use database slicing (offset, max)?
I am using
grails 1.0.3
datatables and jquery
Maybe SlickGrid! is an option for you. One of there examples works with 50000 rows and it seems to be fast.
Christian
I end up writing the result of the map in a table and use the data slicing on that table for pagination. It takes some time to save the data but at least I don't have to worry about the performance with the large data. I use time-stamp to differentiate between requests. each requests will be saved and retrieved with its time stamp.

Best way to store time series in Rails

I have a table in my database that stores event totals, something like:
event1_count
event2_count
event3_count
I would like to transition from these simple aggregates to more of a time series, so the user can see on which days these events actually happened (like how Stack Overflow shows daily reputation gains).
Elsewhere in my system I already did this by creating a separate table with one record for each daily value - then, in order to collect a time series you end up with a huge database table and the need to query 10s or 100s of records. It works but I'm not convinced that it's the best way.
What is the best way of storing these individual events along with their dates so I can do a daily plot for any of my users?
When building tables like this, the real key is having effective indexes. Test your queries with the EXAMINE statement or the equivalent in your database of choice.
If you want to build summary tables you can query, build a view that represents the query, or roll the daily results into a new table on a regular schedule. Often summary tables are the best way to go as they are quick to query.
The best way to implement this is to use Redis. If you haven't worked before with Redis I suggest you to start. You will be surprised how fast this can get :). The way I would do such a thing is to use the Hash data structure Redis provides. Just assign every user to his Hash (making a unique key for every user like "user:23:counters"). Inside this Hash you can store a daily timestamp as "05/06/2011" as the field and increment its counter every time an event happens or whatever you want to do with that!
A good start would be this thread. It has a simple, beginner level solution. Time Series Starter. If you are ok with rails models: This is a way it could work. For a sol called "irregular" time series. So this is a event here and there, but not in a regular interval. Like a sensor that sends data when your door is opened.
The other thing, and that is what I was looking for in this thread is regular time series db: Values come at a interval. Say 60/minute aka 1 per second for example a temperature sensor. This all boils down to datasets with "buckets" as you are suspecting right: A time series table gets long, indexes suck at a point etc. Here is one "bucket" approach using postgres arrays that would a be feasible idea.
Its not done as "plug and play" as far as I researched the web.

Resources