Write native SQL in Core Data - ios

I need to write a native SQL Query while I'm using Core Data in my project. I really need to do that, since I'm using NSPredicate right now and it's not efficient enough (in just one single case). I just need to write a couple of subqueries and joins to fetch a big number of rows and sort them by a special field. In particular, I need to sort it by the sum of values of their child-entites. Right now I'm fetching everything using NSPredicate and then I'm sorting my result (array) manually, but this just takes too long since there are many thousands of results.
Please correct me if I'm wrong, but I'm pretty sure this can't be a huge challenge, since there's a way of using sqlite in iOS applications.
It would be awesome if someone could guide me into the right direction.
Thanks in advance.
EDIT:
Let me explain what I'm doing.
Here's my Coredata model:
And here's how my result looks on the iPad:
I'm showing a table with one row per customer, where every customer has an amount of sales he made from January to June 2012 (Last) AND 2013 (Curr). Next to the Curr there's the variance between those two values. The same thing for gross margin and coverage ratio.
Every customer is saved in the Kunde table and every Kunde has a couple of PbsRows. PbsRow actually holds the sum of sales amounts per month.
So what I'm doing in order to show these results, is to fetch all the PbsRows between January and June 2013 and then do this:
self.kunden = [NSMutableOrderedSet orderedSetWithArray:[pbsRows valueForKeyPath:#"kunde"]];
Now I have all customers (Kunde) which have records between January and June 2013.
Then I'm using a for loop to calculate the sum for each single customer.
The idea is to get the amounts of sales of the current year and compare them to the last year.
The bad thing is that there are a lot of customers and the for-loop just takes very long :-(

This is a bit of a hack, but... The SQLite library is capable of opening more than one database file at a given time. It would be quite feasible to open the Core Data DB file (read/only usage) directly with SQLite and open a second file in conjunction with this (reporting/temporary tables). One could then execute direct SQL queries on the data in the Core Data DB and persist them into a second file (if persistence is needed).
I have done this sort of thing a few times. There are features available in the SQLite library (example: full-text search engine) that are not exposed through Core Data.

If you want to use Core Data there is no supported way to do a SQL query. You can fetch specific values and use [NSExpression expressionForFunction:arguments:] with a sum: function.
To see what SQL commands Core Data executes add -com.apple.CoreData.SQLDebug 1 to "Arguments Passed on Launch". Note that this should not tempt you to use the SQL commands youself, it's just for debugging purposes.

Short answer: you can't do this.
Long answer: Core Data is not a database per se - it's not guaranteed to have anything relational backing it, let alone a specific version of SQLite that you can query against. Furthermore, going mucking around in Core Data's persistent store files is a recipe for disaster, especially if Apple decides to change the format of that file in some way. You should instead try to find better ways to optimize your usage of NSPredicate or start caching the values you care about yourself.
Have you considered using the KVC collection operators? For example, if you have an entity Foo each with a bunch of children Bar, and those Bars have a Baz integer value, I think you can get the sum of those for each Foo by doing something like:
foo.bars.#sum.baz
Not sure if these are applicable to predicates, but it's worth looking into.

Related

Find changes quickly in larger SQL database?

There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).

How to improve the performance in big table join?

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.

Riak MapReduce: Group items by field + sum another field

Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.

Disappointing iOS search times with CoreData

I have a coredata db running on an ipad with iOS 5.1.1. My database has around 50,000 companies in it. I have created an index on the company name attribute. I am searching on this attribute and on occasion get several thousand records returned with a fetchRequest.
When several thousand records are returned then it can take a couple of seconds to return from the fetch. This makes type-ahead searching pretty clunky.
I anticipate having much larger databases in the future. What are my options for implementing a really fast search function?
Thanks.
I recommend watching the core-data performance videos from the last few WWDCs. They often talk about strategies for improving this kind of bottleneck. Some suggestions from the videos :
De-normalise the name field into a separate "case and diacritic insensitive" searchString field and search on that field instead, using <, <= or BEGINSWITH. Avoid MATCHES and wildcards.
Limit the number of results returned by the NSFetchRequest using fetchLimit and fetchBatchSize
If your company object is large you can extract some of the key data items off onto a separate smaller "header" object that is used just for the search interface. Then add a relationship back to the main object when the user makes a selection.
Some pointers to a couple of videos (there are more from other years also):
WWDC 2012: Session 214 - Core Data Best Practices : 45:00
WWDC 2010: Session 137 - Optimizing Core Data Performance on iPhone OS: 34:00
While Core Data is the correct tool in many cases, it's not a silver bullet by any means.
Check out this post which discusses a few optimizing strategies and also cases when an SQL database is your better choice instead of Core Data:
http://inessential.com/2010/02/26/on_switching_away_from_core_data
You may honestly be better off using an SQL database instead of Core Data in this case because when you try to access attribute values on Core Data entities, it typically results in a fault and pulls the object into active memory... this can definitely have performance and speed costs. Using a true database - Core Data is not a true database, see http://cocoawithlove.com/2010/02/differences-between-core-data-and.html - you can query the database without creating objects from it.

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

Resources