Keeping elasticsearch and database in sync - ruby-on-rails

I am trying to figure out a way to keep my mysql db and elasticsearch db in sync. I have setup a jdbc river using the jprante / elasticsearch-river-jdbc plugin for elasticsearch. When I execute the below request:
curl -XPUT 'localhost:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/MY-DATABASE",
"user" : "root",
"password" : "password",
"sql" : "select * from users",
"poll" : "1m"
},
"index" : {
"index" : "test_index",
"type" : "user"
}
}'
the river starts indexing data, but for some records I get org.elasticsearch.index.mapper.MapperParsingException. Well there is discussion related to this issue here, but I want to know a way to get around this issue.
Is it possible to permanently fix this by creating an explicit mapping for all 'fields' of the 'type' that I am trying to index or is there a better way to solve this issue?
Another question that I have is, when the jdbc-river polls the database again, it seems to re-index the entire data-set(given in sql query) again into ES. I am not sure, but is this done because elasticsearch wants to add fresh data as well as update any changes in the existing data? Is it possible to index only the fresh data, if the table's data is static?

Did you look at default mapping?
http://www.elasticsearch.org/guide/reference/mapping/dynamic-mapping.html
I think it can help you here.
If you have an insertion date field in your datatable, you can use it to filter what you have to index.
See https://github.com/jprante/elasticsearch-river-jdbc#time-based-selecting
HTH
David

Elastic Search has dropped the river sync concept at all. It is not a recommended path, because usually it doesn't make sense to keep same normalized SQL table structure in document store like Elastic Search.
Say, you have Product as an entity with some attributes, and Reviews on Product entity as a parent child table as Reviews could be multiple on same table.
Products(Id, name, status,... etc)
Product_reviewes(product_id, review_id)
Reviews(id, note, rating,... etc)
In document store you may want to create a single Index with name say product that includes Product{attribute1, attribute1,... Product reviews[review1, review2,...]}
Here is approach of syncing in such setup.
Assumption:
SQL Database(True Source of record)
Elastic Search or any other NoSql Document Store
Solution:
As soon as Update/updates happens in Publish event/events in JMS/AMQP/Database Queue/File System Queue/Amazon SQS etc. either full Product or primary object ID(I would recommend just ID)
Queue consumer should then call the Web Service to get full object if only Primary ID is pushed to Queue or just take the object it self and send the respective changes to Elastic search/NoSQL database.

Related

Why is TypeORM returning no records in this simple query?

I'm trying to get all the users on my system that match a complex where conditional with TypeORM. My end query would look something like this:
connection.createQueryBuilder()
.select().from(User, "user")
.where("email IS NULL OR email NOT LIKE '%#example.com'")
.getMany()
If I run that query with getCount() it works and tells me how many I have, but getMany() return []
In fact, I simplified it to this:
console.log(
await connection.createQueryBuilder()
.select().from(User, "user")
.getManyAndCount())
I get this surprising result (with logging enabled):
query: SELECT * FROM "user" "user"
query: SELECT COUNT(DISTINCT("user"."id")) as "cnt" FROM "user" "user"
[ [], 14 ]
Any ideas why I would get no users when the count shows 14? I run the query manually and it obviously shows the users... what's going on here?
The code that Carlo offered in one of the answers:
await connection.getRepository(User).findAndCount()
works, but that won't let me have my where clause (as far as I know, I'm still new to TypeORM). I'm just sharing this to show that the User model seems to be working fine, except when I use it with the query builder to select a bunch of users (counting and deleting works).
Keep your code syntax as simple as possible since TypeORM docs (now) are't perfect.
Try using Find Options since I can't find any getManyAndCount() method for QueryBuilder:
const users = await connection
.getRepository(User)
.findAndCount();
EDIT:
Of course you can have (complex) where clause with find.
You can chain multiple where clauses (OR) with a really simple syntax. Check out all options here.
Example that map your "raw" query:
const users = await connection
.getRepository(User)
.findAndCount({
where: [
{
email: IsNull(),
},
{
email: Not(Like('%#example.com')),
},
],
});
Hope it helps :)
Let me divide that step-by-step approach
despite of getManyAndCount() use getRawEntities()
you will get the data for sure by using 1st point
now concentrate on the keys you are getting on data
use the same key on your select query and done
if you didn't get records using getRawEntities():
I. try to use select * i.e, select()
II. check the column name and correct that
Additionally, It can also depend on your entity structure. Carefully check that too.

Multiple nesting in Falcor query

I am trying to query a multiple nested object with Falcor. I have an user which has beside other the value follower which itself has properties like name.
I want to query the name of the user and the first 10 follower.
My Falcor server side can be seen on GitHub there is my router and resolver.
I query the user with user["KordonDev"]["name", "stars"]. And the follower with user["KordonDev"].follower[0.10]["name", "stars"].
The route for follower is user[{keys:logins}].follower[{integers:indexes}] but this doesn't catch the following query.
I tried to add it as string query.
user["KordonDev"]["name", "stars", "follower[0..10].name"] doesn't work.
The second try was to query with arrays of keys. ["user", "KordonDev", "follower", {"from":0, "to":10}, "name"] but here I don't know how to query the name of the user.
As far as I know and looking on the path parser. There is no way to do nested queries.
What you want to do is batch the query and do two queries.
user["KordonDev"]["name", "stars"]
user["KordonDev"]["follower"][0..10].name
It seems that falcor does not support this, there is even a somewhat old issue discussing how people trying to do nested queries.
to the point about the current syntax leading people to try this:
['lolomo', 0, 0, ['summary', ['item', 'summary']]]
I can see folks trying to do the same thing with the new syntax:
"lolomo[0][0]['summary', 'item.summary']"
As soon as they know they can do:
"lolomo[0][0]['summary', 'evidence']"
So it seems deep nested queries is not a functionality.

Calculating the Count of a related Collection

I have two models Professionals and Projects
Professionals hasMany Projects
Projects belongsTo Professionals
In the Professionals index page i need to show the number of projects the Professional has.
Right now i am doing the following query to get all the Professionals.
How can i fetch the count of the Projects of each of the Professionals as well.
#pros = Professionals.all.asc(:name)
I would add projects_count to Professional
Then
class Project
belongs_to :professional, counter_cache: true
end
And rails will handle the count every time a project is added to or removed from a professional. Then you can just do .projects_count on each professional.
Edit:
If you actually want additonal data
#pros = Professionals.includes(:projects).order(:name)
Then
#pros.each do |pro|
pro.name
pro.projects.each do |project|
project.name
end
end
I am just abstracting here because the rails thing really isn't my bag. But let's talk about schema and things to look for. And as such the code is really just "pseudo-code" but should be close to what is wanted.
Considering "just" how MongoDB is going to store the data, and that you presumably seem to have multiple collections. And I am not saying that is or is not the best model, but just dealing with it.
Let us assume we have this data for "Projects"
{
"_id" : ObjectId("53202e1d78166396592cf805"),
"name": "Project1,
"desc": "Building Project"
},
{
"_id" : ObjectId("532197fb423c37c0edbd4a52")
"name": "Project2",
"desc": "Renovation Project"
}
And that for "Professionals" we might have something like this:
{
"_id" : ObjectId("531e22b7ba53b9dd07756bc8"),
"name": "Steve",
"projects": [
ObjectId("53202e1d78166396592cf805"),
ObjectId("532197fb423c37c0edbd4a52")
]
}
Right. So now we see that the "Professional" has to have some kind of concept that there are related items in another collection and what those related items are.
Now I presume, (and it's not my bag) that there is a way to get down to the lower level of the driver implementation in Mongoid ( I believe that is Moped off the top of my head ) and that it likely ( from memory ) is invoked in a similar way to ( asssuming "Professionals" as the class model name ) :
Professionals.collection.aggregate([
{ "$unwind": "$projects" },
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}
])
Or in some similar form that is more or less the analog to what you would do in the native mongodb shell. The point being, with something like this you just made the server do the work, rather than pulling all the results to you client and looping through them.
Suggesting that you use native code to iterate results from your data store is counter productive and counter intuitive do using any kind of back end database store. Whether it by a SQL database or a NoSQL database, the general preference is as long as the database has methods to do the aggregation work, then use them.
If you are writing code that essentially pulls every record from your store and then cycles through to get the result then you are doing something wrong.
Use the database methods. Otherwise you might as well just use a text file and be done with it.

Rails, Soulmate, Redis remove record

I use Soulmate to autocomplete search results, however I want to be able to delete records after a while so they don't show up in the searchfield again. To reload the list with Soulmate seems a bit hacky and unnecessary.
I have used json to load and I have a unique record "id"
{"id":1547,"term":"Foo Baar, Baaz","score":85}
How can I delete that record from redis so it wont show up in the search results again?
It is not trivial to do it directly from Redis, using redis-cli commands.
Looking at soulmate code, the data structure is as follows:
a soulmate-index:[type] set containing all the prefixes
a soulmate-data:[type] hash object containing the association between the id and the json object.
per prefix, a soulmate-index:[type]:[prefix] sorted set (with score and id)
So to delete an item, you need to:
Retrieve the json object from its id (you already did it) -> id 1547
HDEL soulmate-data:[type] 1547
Generate all the possible prefixes from "Foo Baar, Baaz"
For each prefix:
SREM soulmate-data:[type] [prefix]
ZREM soulmate-index:[type]:[prefix] 1547
Probably it would be easier to directly call the remove method provided in the Soulmate.Loader class from a Ruby script, which automates everything for you.
https://github.com/seatgeek/soulmate/blob/master/lib/soulmate/loader.rb

django.db.utils.IntegrityError: (1062, "Duplicate entry '22-add_' for key 'content_type_id'")

I am using django multiple DB router concepts, having multiple sites with different db's. Base database user will login with all other sub sites.
When i try syncdb in base site its worked properly(at any time), but trying syncdb with other sites works first time only, if we try next time on-wards it throws integiry error like below
django.db.utils.IntegrityError: (1062, "Duplicate entry
'22-add_somesame' for key 'content_type_id'")
Once i removed multiple DB router settings in that project means syncdb works properly(at any time).
So is this relates to multiple db router? or what else?
Please anyone advise on this, thanks.
The problem here is with the db router and django system objects. I've experienced the same issue with multiple DBs and routers. As I remember the problem here is with the auth.permission content types, which get mixed in between databases. The syncdb script otherwise tries to create these in all databases, and theb it creates permission content type for some object, which id is already reserved for a local model.
I have the following
BASE_DB_TYPES = (
'auth.user',
'auth.group',
'auth.permission',
'sessions.session',
)
and then in the db router:
def db_for_read(self, model, **hints):
if hasattr(model, '_meta') and str(model._meta) in BASE_DB_TYPES:
return 'base_db' # the alias of base db that will store users
return None # the default database, or some custom mapping
EDIT:
Also, the exception might say that you're declaring a permission 'add_somesame' for your model 'somesame', while Django automatically creates add_, delete_, edit_ permissions for all objects.

Resources