Generating a unique numeric id in neo4j - neo4j

I am aware of the UUID module but as far as I know that module does not allow you to use only numeric characters. We expect our database to have millions of records and a number search is faster than a character search.
Is there a better way of generating a unique ID for each node?
If you tell me to use UUID, in traversing a graph database with millions and a possible billion of nodes how badly would the performance suffer?

The property type used for a single lookup will not differ by using numbers or uuid strings, it will always remain an O(1)+1 operation (if you back it up by a unique constraint).
On another side, the uuid module benefit recently from a sequential ID generator that you can choose instead of the default uuid generator :
https://github.com/graphaware/neo4j-uuid/blob/master/README.md#specifying-the-generator-through-configuration

Related

Multi Tenant dynamic key value store

I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs

What is the difference between randomUUID and GraphAware UUID in Neo4j?

I am currently using GraphAware UUID Library to generate UUID in neo4j, later I found out that it has a randomUUID() function to generate UUID, which one should be used?, will randomUUID() create unique id on server?
They both call java.util.UUID#randomUUID(), that's where the similarity between them ends.
The built-in Cypher's randomUUID() is a function, which you have to call manually in each cypher query where you want to assign a UUID.
The neo4j-uuid module is a set of extensions to Neo4j, which allow you to transparently assign UUID (or other types of ids - depending on configured id generator) to nodes and relationships, ensures the ids can't be changed or deleted. It also maintains an explicit index for the nodes / relationships. See the readme for the full feature set.
If your use case is simply to assign a uuid to (some) nodes or relationships then use the built in function. If you can take advantage of the other features of the neo4j-uuid module - use that.
For manual use cases, creating the UUID yourself in a Cypher query, they're functionally identical (GraphAware implemented this first I think, we got to it later). Yes the ids will be created on the server and will be unique, both
I believe GraphAware's UUID module covers more than just this, doing automatic assigning of UUIDs to newly created nodes and relationships and extra validation on top of that.

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Mongodb: Is it a good idea to create a unique index on web URLs?

My document looks like:
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
}
The url needs to be unique. Is it a good idea to have a unique index on the url? The URL can be long, resulting in larger index size, more memory footprint, and slower overall performance. Is it a good idea to generate a hash from the url (i am thinking about using murmur3) and create a unique index on that instead. I am assuming that the chances of collision are pretty low, as described here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Does anyone see any drawbacks to this approach? The new document will look like (with a unique index on u_hash instead of url):
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
"u_hash": "<murmur3 hash of url>"
}
UPDATE
I will not be doing regex queries on the url. Will be doing only a complete URL look up. I am more concerned about the performance of this look up, as I believe it will also be used internally by mongodb to maintain unique index, and hence affecting write performance as well (+ longer index). Additionally, my understanding is that mongobd doesn't perform well for long text indexes, as it wasn't designed for that purpose. I may be wrong though, and it could only depend on whether or not that index fits into RAM. Any pointers?
I'd like to expand on the answer of #AlexRyan. While he is right in general, there are some things which need to be taken into consideration for this use case.
First of all, we have to differentiate between a unique index and the _id field.
When the URL needs to be unique in your use case, there has to be a unique index. What we have to decide is wether to use the URL itself or a hashed value of it. The hashing itself would not help with the search, as the hash sum saved in a field would be treated as a string by MongoDB. It may safe space (URLs may be shorter than their hash value), hereby reducing the memory needed for the index. However, doing so takes away the possibility to search for parts of the URL in the index, for example with
db.collection.find({url:{$regex:/stackoverflow/}})
With a unique index on url, this query would use an index, which will be quite fast. Without such (unique) index, this query will result in a comparably slow collection scan.
Plus, creating the hash each and every time before querying, updating or inserting doesn't make these operations faster.
This leaves us with the fact that creating a hash sum and a unique index on it may save some RAM at the cost of making queries on the actual field slower at orders of magnitude. And it introduces the need of creating a hash sum each and every time. Having a index on both the URL and it's hashed value would not make sense at all.
Now to the question wether it is a good idea to use URL as _id one way or the other. Since URLs usually are distinct by nature (they are supposed to return the same content) and the likes are related to that uniqueness, I would tend to use the URL as the id. Since you need the unique index on _id anyway, it serves two purposes here: you have your id for the document, you ensure uniqueness of the URL and - in case you use the natural representation of the URL - it will even be queryable in an efficient way.
Use a unique index on url
db.interwebs.ensureIndex({ "url" : 1}, { "unique" : 1 })
and not a hashed index. Hashed indexes in MongoDB are meant to be used for hashed shard keys and not for unique constraints. From the hashed index docs,
Hashed indexes support sharding a collection using a hashed shard key. Using a hashed shard key to shard a collection ensures a more even distribution of data.
and
You may not create compound indexes that have hashed index fields or specify a unique constraint on a hashed index
If url needs to be unique and you will use it to look up documents, it's absolutely worth having a unique index on url. If you want to use url as the primary key for documents, you can store the url value in the _id field. This field is normally a driver-generated ObjectId but it can be any value you like. There's always a unique index on _id in a MongoDB collection so you get the unique index "for free".
I think the answer is "it depends".
Choosing keys that have no real world meaning embedded in them may save you pain in the future. This is especially true if you decide you need to change it but you have a lot of foreign keys referencing it.
Most database management systems offer you a way to generate unique IDs.
In Oracle, you might use a sequence.
In MySQL you might use AUTO_INCREMENT when you define the table itself.
The way that mongodb assigns unique ids to documents is different than in relational databases. They use ObjectIDs for this purpose.
One of the interesting things about ObjectIDs is that they are generated by the driver.
Because of the algorithm that is used to generate them, they are guaranteed to be unique even if you have a large cluster of app and database servers.
You can learn more about them here:
http://docs.mongodb.org/manual/reference/object-id/
A lot of engineering work has gone into ensuring that ObjectIds unique.
I use them by default unless there is a really good reason not to.
So far, I have not found a really good reason to not use them.

Unique Identifiers that are User-Friendly and Hard to Guess

My team is working on an application with a legacy database that uses two different values as unique identifiers for a Group object: Id is an auto-incrementing Identity column whose value is determined by the database upon insertion. GroupCode is determined by the application after insertion, and is "Group" + theGroup.Id.
What we need is an algorithm to generate GroupCode's that:
Are unique.
Are reasonably easy for a user to type in correctly.
Are difficult for a hacker to guess.
Are either created by the database upon insertion, or are created by the app before the insertion (i.e. not dependent on the identity column).
The existing solution meets the first two criteria, but not the last two. Does anyone know of a good solution to meet all of the above criteria?
One more note: Even though this code is used externally by users, and even though Id would make a better identifier for other tables to link their foreign keys to, the GroupCode is used by other tables to refer to a specific Group.
Thanks in advance.
Would it be possible to add a new column? It could consist of the Identity and a random 32-bit number.
That 64 bit number could then be translated to a «Memorable Random String». It wouldn't be perfect security wise but could be good enough.
Here's an example using Ruby and the Koremutake gem.
require 'koremu'
# http://pastie.org/96316 adds Array.chunk
identity=104711
r=rand(2**32)<<32 # in this example 5946631977955229696
ka = KoremuFixnum.new(r+identity).to_ka.chunk(3)
ka.each {|arr| print KoremuArray.new(arr).to_ks + " "}
Result:
TUSADA REGRUMI LEBADE
Also check out Phonetically Memorable Password Generation Algorithms.
Have you looked into Base32/Base36 content encoding? Base32 representation of a Identity seed column will make it unique, easy to enter but definitely not secure. However most non-programmers will have no idea how the string value is generated.
Also using Base32/36 you can maintain normal database integer based primary keys.

Resources