I'm wondering that how can I handle this situation with unsupervised/semi-supervised/supervised approach in real time (for each record online):
I have normal data of a company for a year. It is Big Data (I mention this because it may help in proposing solution - I'm going to work on Spark). Data is only labeled with normal. But there are some problems:
1) Distribution of data with normal class, can vary. This means its features values can vary after building a model because of dynamic environment and adding/removing some sensors (data is received from sensors and adding a sensor affects other sensor values).
2) I want to find a solution (build a model) for separating normal data from probable abnormal data which will have in future. The abnormal data types(classes) can be different considering its effect on system.
Now, how can I build a model (find a solution) to handle this situation? I think because of huge amount of data that I have there is a way to distinguish normal one from the abnormal one. I hope it be true!
Thank you in advance.
Related
I am building an iOS application that will randomly generate sentences (think Mad Libs) where the data used for generation is in multiple tables. This will be used to generate scenarios for training lifeguards. Each table contains an item name, the words that will be used when selected, and different values that determine what can go togeather.
Using two of the 10 tables shown above, the application may pick a location of Deep Water. Then it needs to pick an appropriate activity for in the water, such as Breath holding, but not Running.
I have been looking at Core Data for storage but that seems to be more for data that is changing often by the user and users would never change the data stored. I do want to be able to update the tables myself fairly easily. What would be the optimal solution to do this? The ways I think of are:
Some kind of SQL DB, though my tables again aren't changing and
aren't relationshipable.
2-D arrays written into the source code. Not pretty to work with or read, but my knowledge of regex makes converting from TSV to array fairly easy.
TSV files attached to the project. Better organization itself but take some research on how to access.
Some other method Apple has that I do not know about.
I'm wondering about the performance differences between two different methods of data filtering. Here's what I'm working with:
A set of core data objects
A UISegmentedControl that represents a boolean filter (learned vs. not learned)
A UITableView that displays the filtered data set
As I see it, there are two possible approaches here:
Pull the entire core data set in viewDidLoad. Filter the array of data using a predicate when the segmented control value changes. Reload the tableview.
Initially pull the core data set with a predicate. When the segmented control value changes, re-pull the core data set with the updated predicate. Reload the tableview.
I know there are factors that influence the answer (how large the data set is, how often the segmented control will be used), I'm just wondering if there is an overall best practice between the two.
There are trade-offs between the two approaches, and the best choice depends on how important the differences are to you. No one approach is the best practice for every situation.
Loading everything up front in one array will likely have:
Slower startup time (because of doing a large fetch right off)
Higher memory use (since you're fetching everything rather than just a subset)
Faster when switching between filter options (since you already have the data)
Doing a new fetch every time will likely have:
Faster startup time
Lower memory use (since you only ever have a subset of the total collection)
Slower when switching between filter options, at least at first (Core Data's internal row cache will speed things up on subsequent switches).
How significant the factors are depends on your data and your app. If you have lots of data, then the memory use may be significant (fetching every instance of an entity type is an easy way to blow out your memory use). Speed concerns depend on what else your app is doing at the same time and, frankly, on whether either option is slow enough to cause a noticeable delay. If your data set is small, it probably doesn't make much difference which approach you use.
I don't expect there will be any user-noticeable speed difference.
Therefore these are best practices that I think are relevant here:
Avoid premature optimization.
You're constrained by memory more often than speed.
Design in advance.
From this I deduce three points of advice that apply to current problem:
Go with the method that is easiest to maintain.
Don't pull more objects from Core Data than necessary.
Have some strategy about updating the data in the tableview.
To combine those points in one advice, it's best to use the class NSFetchedResultsController for displaying Core Data in tables as it's specifically designed for this purpose:
Encapsulates the idea of "the chunk of data I'm currently displaying".
Saves you memory by not pulling things you don't need.
Helps with updating data in tableview.
You can play with examples of it by creating a new Core Data-based project in Xcode (4.4 or later). It's closer to the second of your approaches.
I've got an idea for a new web app which will involve the following:
1.) lots of raw inputs (text values) that will be stored in a db - some of which contribute as signals to a ranking algorithm
2.) data crunching & analysis - a series of scripts will be written which together form an algorithm that will take said raw inputs from 1.) and then store a series of ranking values for these inputs.
Events 1.) and 2.) are independent of each other. Event 2 will probably happen once or twice a day. Event 1 will happen on an ongoing basis.
I initially dabbled with the idea of writing the whole thing in node.js sitting on top of mongodb as I will curious to try out something new and while I think node.js would be perfect for event 1.) I don't think it will work well for the event 2.) outlined above.
I'd also rather keep everything in one domain rather than mixing node.js with something else for step 2.
Does anyone have any recommendations for what stacks work well for computational type web apps?
Should I stick with PHP or Rails/Mysql (which I already have good experience with)?
Is MongoDB/nosql constrained when it comes to computational analysis?
Thanks for your advice,
Ed
There is no reason why node.js wouldn't work.
You would just write two node applications.
One that takes input stores it in the database and renders output
and the other one crunches numbers in it's own process and is run once or twice per day.
Of course if your doing real number crunching and you need performance you wouldn't do nr 2 in node/ruby/php. You would do it in fortran (or maybe C).
I have a spreadsheet, approximately 1500 rows x 1500 columns. The labels along the top and side are the same, and the data in the cells is a quantified similarity score for the two inputs. I'd like to make a Rails app allowing a user to enter the row and column values and retrieve the similarity score. The similarity scores were derived empirically, and can't be mathematically produced by the controller.
Some considerations: with every cell full, over half of the data is redundant; e.g., (row 34, column 985) holds the same value as (row 985, column 34). And row x will always be perfectly similar to column x. The data is static, and won't change for years.
Can this be done with one db table? Is there a better way? Can I skip the relational db entirely and somehow query the file directly?
All assistance and advice is much appreciated!
Database is always a safe place to store it. Relational Database is straightforward and a good idea. However there are alternatives to consider. How often will this data be accessed? Is it accessed rarely or very frequently? If it's accessed very rarely, just put it in the database and let your code take care of searching and presenting. You'll optimize it by database indexes etc.
Flat-File is a good idea but reading and searching it at run time for every request is going to be too slow.
You could read all the data (from db/file) at server startup, and keep it in memory and ensure that your servers dont restart too often. It means each one of your servers is going to sit with the entire grid in memory but computation is going to be really fast. If you use REE and calibrate the Garbage Collection settings, you can also minimize the startup time of the server to a large extent.
Here's my final suggestion. Just build your app in the simplest way you know. Once you know how often and how much your app is going to be used, you start optimizing. You are fundamentally working with 1125000 cells. This is not unreasonably large dataset for a database to process. But since your dataset will not change, you can go far by conventional caching techniques.
I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.