Do I have to use UNION insted of JOIN? - join

An article about Optimizing your SQL queries has suggested to use Union insted of OR `cause:
Utilize Union instead of OR
Indexes lose their speed advantage when using them in OR-situations in
MySQL at least. Hence, this will not be useful although indexes is
being applied 1 SELECT * FROM TABLE WHERE COLUMN_A = 'value' OR
COLUMN_B = 'value'
On the other hand, using Union such as this will utilize Indexes.
1- SELECT * FROM TABLE WHERE COLUMN_A = 'value'
2- UNION
3- SELECT * FROM
TABLE WHERE COLUMN_B = 'value'
How much this suggestion is true? Should I turn my OR queries to Union?

I do not recommend this as a general rule. I don't have MySQL installed here so can't check the generated execution plan, but certainly in Oracle the plans are much different and the 'OR' plan is more efficient than the 'UNION' plan. (Basically, the 'UNION' has to perform two SELECT's and then merge the results - the 'OR' only has to do a single SELECT). Even a 'UNION ALL' plan has a higher cost than the 'OR' plan. Stick with the 'OR'.
I very strongly recommend writing the clearest, simplest code you possibly can. To me, that means you should use the 'OR' rather than the 'UNION'. I've found over the years that attempting to "pre-optimize" something, or in other words to guess where I'll encounter performance problems and then to try and code around those problems, is a waste of time. Code tends to spend a lot of time in the darndest places, and you'll only find them once you've got your code running. A long time ago I learned three rules that I've found to be useful in this context:
Make if run.
Make it run right.
Make it run right fast.
Optimization is literally the last thing you should be doing.
Share and enjoy.
Followup: I hadn't noticed that the 'OR' was looking at different columns (my bad), but my advice regarding "keep it simple" still holds.

It helps to think of indexes like names in a phone book. A phone book, you could say, has a naturally ordered index by name, meaning, if you want to find all names John Smith, it would take you little to no time to find it. You'd simply open the phone book to the S section and begin looking up Smith.
Now what if I told you to look for entries in the phone book with name John Smith or phone number 863-2253. Not as quick to do, eh? To provide a precise answer, you'd need a phone book to look up John Smith and another one sorted by phone numbers in order to find a name by his or her phone number.
Perhaps a more sophisticated engine could see the need for this separation and do it automatically, but apparently MySQL does not. So while it might seem a hassle to have to do it this way, I assure you the difference in tables with high record counts is noticeable.

Related

Fuzzy neo4j relationships

I want to do something in neo4j that I hope will work ok: I want to make "fuzzy" path matches; the links will sometimes count as a relationship, and sometimes not, depending on the query.
Here's an example: let's say I have a (p:Person)-[:HAS]->(n:Name). A search has found a Person (say, by phone number). I want to go from this Person to other Persons with similar names, to get their phone numbers. Also, I want the similarity to be adjustable, so the user might ask to match very similar names, or not very similar names.
I could get the first person's name, and then do a search against other names with some lucene patterns - this is easy enough, but it means doing a full lucene search on the Name values, which in my use case is not ideal as I think it might be a bit slow (there are very many names - let's say a billion, remembering this is just an example). I hope there is a better way.
One approach I can imagine is having a "similarity" relationship between Names. Whenever a new Name node is added, we check for similar names and link them (creating these relationships would be slow, but we could push it onto a batch process, and it's ok if it takes some minutes). We would only link names that were fairly similar (so the number of links would hopefully not get too large). I suppose we could then craft a query on this, matching similarities greater than my threshold. Something like this:
MATCH (p1:Person {phone:"555-234234"})-->(n1:Name)-[s:SIMILAR]->(n2:Name)-->(p2:Person)
WHERE s.matchLevel >=2
RETURN p2.phone;
Is this approach better or worse than just doing the lucene search? Has anyone else wanted to do something like this?
Also, based on the suggestion at http://graphaware.com/neo4j/2013/10/24/neo4j-qualifying-relationships.html, I believe I'll be better off having many relationships (SIMILAR_1, SIMILAR_2 ..) instead of using a "match level" attribute on my relationship.
BTW, I know there are many similar questions to this (eg. Neo4j 2 Cypher fuzzy search), but afaik this exact question isn't on stackoverflow (and I have looked).

Should the "count" measure be stored in the fact table?

I have a fact table that includes "wait times in hours" for certain services. I have a lot of dimensions that could describe the wait-times based on different slices; however, I am also interested in knowing how many people (counts) came for services through the filters of the same dimensions.
Given the dimensions for both the wait-times in hours and the number of people who got services are exactly the same, I think it's best practice to keep it in the same fact table. My question is:
Should there be a different fact table for the count measure mentioned?
How would I include this measure? Do I just put 1 in every single row? Because regardless of the wait-time, they've gotten the service only once (you cannot go above/below 1 in my scenario).
1) Think about the grain of your existing fact table. It sounds like it's probably "an occasion on which a person received a service." If that's the same thing you're trying to count, then yes - the waiting time and the count are the same grain.
However, while they may well be the same grain, there might be no need to add anything to the table. Read point 2 for an explanation.
2) You could put a 1 in a column on every row, but I'm not sure what you'd gain from it. You've not said what tools will be consuming this data, but you should be able to do a count/distinct count of some kind.
Working on the basis that you've tagged SSIS so are likely using Microsoft's BI stack:
TSQL has count(), and you can do count(distinct [column]).
SSAS has both counts and distinct counts as aggregation types.
MDX offers several different types of count.
SSRS has Count, CountDistinct, and CountRows.
Whether you do a normal count or a distinct count will depend on whether you're trying to ask "How many people used this service?" or "How many different people used this service?"

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

SSIS Merge Join component wrote 0 rows

First of all, thanks to the community for the amount of information on the site, helped me a lot with C# and SSIS. The second thing is that i'm not very good with english, so please be patient, if you don't understand something, please ask, i'll try to make it better.
I got 2 OLEDB connection source from different databases, both tables got a column with an ID that I use as a Join Key. In RUT CRUZADOS, the ID its a float datatype, while in the other source (CTACTE AÑO PAS) I don't know which type of data it is (I can't open the database with sql server, i can only do SELECT operations).
When I combine them in the Merge,it doesn't return me any mistake, but when I run the program, this happens.
[SSIS.Pipeline] Information: "component "CARGOS ABONOS" (239)" wrote 0
rows.
In Microsoft Access, the "Inner Join" returns like 4 Millions of rows. I think the problem its the metadata but i dont know how to use the "Data Conversion". Can someone help me please.
Thank you all
You can view the data types, at least as far as SSIS is concerned by double clicking on connector lines. In the Data Flow Path Editor that pops up, the Metadata tab will describe the column types.
That said, it doesn't matter because the Merge Join transformation is only going to allow you to merge data of the same type.
A Merge Join requires the source system data to be sorted. This is accomplished by either adding sort components into the stream (not recommended as this is an asynchronous transform that eats all your memory and kills your performance) or by explicitly sorting in your source systems and then marking them as sorted in the Advanced tab.
Since I don't see a Sort, that leads me to believe the sort is done in the source systems. Or, the sorts are not done there but someone has marked the output as sorted. There must be explicit ORDER BY clauses in those source queries. Sometimes, SQL Server will return data in the same order but unless there is an ORDER BY, it cannot be guaranteed. (I wish I could use the flash tag to emphasis the last point).
Future readers, if you have a sort in both systems and they are both sorted on the same column, then you need to examine collations. Case Insensitive is a different beast than Case Sensitive and a sort on an ASCII based system yields a different sort than one using EBCIDIC for mixed alpha-numeric like I once had...
As the source data type appears to be floats, then sorting is not the likely culprit. The realization is dawning on me, instead of sort issues, you likely have an uglier and more insidious comparison issue. Floating point numbers are approximations. 1=1 but 1.00000000000(etc) may or may not be equal to 1.0000000000(etc)1
Do you actually need the decimal places to make the match? If not, casting to an integer in both (and sorting on the CAST'ed value) systems should make these matches work. If there are decimal places that matter, then you're going to need to cast that into an exact numeric type (and pray that they both convert in the same way). The fact that Access does it leads me to believe Integer data type will be your salvation.

How to get a search ranking based on multiple factors in sphinx?

Hello stackoverflow folks,
We got a Rails project which is growing and growing and we now get first performance problems on the search, because we don't know how to utilize sphinx properly for our needs.
We have search queries like "Java PHP Software developer". Our problem is now the ranking should work with multiple things.
As search fields we have tag list, description and title.
If one of the terms is inside of one of the fields it should get for example 2 points. More Points if its in more fields, but not multiple points if it is in the same field more than once.
Next Problem is I have a big file with synonyms for which should also be checked. It looks like this:
Java > Java
Java-EE > Java
...
So if Java-EE is found it should get some points too but with a penalty for being a synonym.
Maximum amount of points would be 5 as in 5 stars which get displayed.
Any speedy solution would be nice because at the moment it's done in plain ruby and it gets slow, because we cant rank properly in sphinx.
If there is a solution with another search engine that would also be very nice, as it could be changed.
Thanks in advance for all efforts. All spelling corrections and questions to clear the question are welcome.
Most of the performance issues can be solved by changing the way you use sphinx. First you need to address how you index the data in sphinx. Doing some processing during while indexing will make the search quicker and the results more relevant. Second, tackle the search terms and last but not least, decide on the ranking algorithm to use.
I am going to use the "title" field as an example, but the logic can be replicated for all fields.
Indexing
Add two fields to sphinx ("title" and "title_synonyms"). For each record in the database do the following :-
Perform a DISTINCT on the words to remove duplicates ("Ruby Developer / Java Developer" will become "Ruby Developer / Java". This will stop records from getting two scores for duplicates when searching. This goes in to "title"
Take the DISTINCT title from above and REPLACE all the words with their expanded synonym equivalents. I would suggest putting the synonyms in the DB to make the expansion easier. The text would then become "Ruby Developer / Java-EE". Each word must be replaced with all the synonyms. If Java has two synonyms, they both must be in the field. This goes into "title_synonyms"
Searching
Because there are now two fields in sphinx we can give them each a different weight; "title" can get a weight of "10" and "title_synonyms" a weight of "3". That means a record has to match 4 synonyms before it ranks higher than one with the original title. You can play around with the weights to suit your needs.
Lets assume a user was searching for "Java Developer". For the search phrase do the following :-
Remove duplicate words
Get synonyms for each word in the search phrase
Set Matching Mode in Sphinx to SPH_MATCH_EXTENDED
The above rules will mean the search in sphinx looks like this :-
#title "Java Developer" | #title_synonyms "Java-EE"
If you want to rank exact matches higher than lexemes, the search query would look like this :-
#title ("Java Developer" | "=Java =Developer") | #title_synonyms ("Java-EE" | "=Java-EE")
You will need to use SPH_RANK_PROXIMITY_BM25 or SPH_RANK_SPH04 to make this work properly though.
Ranking
You can try any of the built in ranking algorithms to see what the results look like. I recommend SPH_RANK_MATCHANY or SPH_RANK_WORDCOUNT as a start.
For Proximity and exact match ranking use SPH_RANK_PROXIMITY_BM25, SPH_RANK_SPH04 or SPH_RANK_EXPR where you can use your own algorithm.
Conclusion
You should now have a search that is both fast and accurate. Very little work has to be done by your Ruby application and most of the work is done inside sphinx (where it should be).
Hope this helps...
This performance problem is an algorithm problem.
If you cannot express the problem in a way to utilize a backend tool, like sphinx or the database engine, then you are doing the processing in ruby, and that's easy to have a performance problem.
First, do as much as you can with sphinx (or whatever other search engine) and the database as you can. The more pre-digested the data coming into ruby, the less you have to do in ruby code, and that will likely be faster, since databases have been highly optimized over the last half century.
So, for example, run sphinx on the key words. Also run sphinx on the synonyms. Limit all the answers to the top results, and merge the results. That way your ruby code will be limited to the likely high results instead of having to consider the whole database of entries.
Once in ruby, the most important thing is to avoid high order algorithms, that is, make sure you are using a low order algorithm.
As you process your raw data, if you hold your top results in an array and try to sort or scan the array, you are going to have an N-squared order. That is, your order will be the product of the number of raw entries and the number of elements you keep in your array.
The best algorithms for your problem are a priority queue implemented by a heap like container, or a b-tree. Both have N-log-N order (N times the log of N), or the number of raw data records time the log of the number of items you will keep in your container.
A heap is a binary tree, where each node in the tree (not just the leaves but each node) has a rated record. The nodes below each record all have lower ranks. This is called the heap condition.
There are algorithms for adding elements, taking the top ranked element out, and replacing the lowest ranked element which maintain the heap condition. Look up binary heap in the wikipedia.
Let's say your site will display the top 100 ranked results. Maintain a help where the root is the lowest ranked. Populate the heap by adding the first 100 raw records you are processing.
Now for record 101 and after, compare its rank with the root. If the new record is ranked higher, use the delete algorithm to reduce your heap to 99 nodes (which will remove the lowest ranked record in the heap) and add your new record to the heap.
Once you have gone through all your records, you will have the top 100 ranked results. The heap delete algorithm will pull them out in reverse order.

Resources