I am new to Neo4j. I am trying to populate Yelp dataset in Neo4j. Basically, I am interested in three json file provided by them i.e.
user.json
{
"user_id": "-lGwMGHMC_XihFJNKCJNRg",
"name": "Gabe",
"review_count": 277,
"yelping_since": "2014-10-31",
"friends": ["Oa84FFGBw1axX8O6uDkmqg", "SRcWERSl4rhm-Bz9zN_J8g", "VMVGukgapRtx3MIydAibkQ", "8sLNQ3dAV35VBCnPaMh1Lw", "87LhHHXbQYWr5wlo5W7_QQ"],
"useful": 45,
"funny": 4,
"cool": 55,
"fans": 17,
"elite": [],
"average_stars": 4.72,
"compliment_hot": 5,
"compliment_more": 1,
"compliment_profile": 0,
"compliment_cute": 1,
"compliment_list": 0,
"compliment_note": 11,
"compliment_plain": 20,
"compliment_cool": 15,
"compliment_funny": 15,
"compliment_writer": 1,
"compliment_photos": 8
}
I have omitted several entries from friends array to make output readable
business.json
{
"business_id": "YDf95gJZaq05wvo7hTQbbQ",
"name": "Richmond Town Square",
"neighborhood": "",
"address": "691 Richmond Rd",
"city": "Richmond Heights",
"state": "OH",
"postal_code": "44143",
"latitude": 41.5417162,
"longitude": -81.4931165,
"stars": 2.0,
"review_count": 17,
"is_open": 1,
"attributes": {
"RestaurantsPriceRange2": 2,
"BusinessParking": {
"garage": false,
"street": false,
"validated": false,
"lot": true,
"valet": false
},
"BikeParking": true,
"WheelchairAccessible": true
},
"categories": ["Shopping", "Shopping Centers"],
"hours": {
"Monday": "10:00-21:00",
"Tuesday": "10:00-21:00",
"Friday": "10:00-21:00",
"Wednesday": "10:00-21:00",
"Thursday": "10:00-21:00",
"Sunday": "11:00-18:00",
"Saturday": "10:00-21:00"
}
}
review.json
{
"review_id": "VfBHSwC5Vz_pbFluy07i9Q",
"user_id": "-lGwMGHMC_XihFJNKCJNRg",
"business_id": "YDf95gJZaq05wvo7hTQbbQ",
"stars": 5,
"date": "2016-07-12",
"text": "My girlfriend and I stayed here for 3 nights and loved it.",
"useful": 0,
"funny": 0,
"cool": 0
}
As we can see in the sample files that relationship between user and business is associated via the review.json file. How can I create a relationship edge between user and business using the review.json file.
I have also seen Mark Needham tutorial where he has shown StackOverflow data population but in that case, relationship file was already present with sample data. Do I need to build a similar file? If yes, how should I approach this problem? or is there any other way to build relationship between user & business?
It very much depends on your model as to what you want to do, but you could do 3 imports:
//Create Users - does assume the data is unique
CALL apoc.load.json('file:///c://temp//SO//user.json') YIELD value AS user
CREATE (u:User)
SET u = user
then add the businesses:
CALL apoc.load.json('file:///c://temp//SO//business.json') YIELD value AS business
CREATE (b:Business {
business_id : business.business_id,
name : business.name,
neighborhood : business.neighborhood,
address : business.address,
city : business.city,
state : business.state,
postal_code : business.postal_code,
latitude : business.latitude,
longitude : business.longitude,
stars : business.stars,
review_count : business.review_count,
is_open : business.is_open,
categories : business.categories
})
For the businesses, we can't just do the SET b = business because the JSON has nested maps. So you might want to decide if you want them, and might have to go down a different route.
Lastly, the reviews, which is where we join it all up.
CALL apoc.load.json('file:///c://temp//SO//review.json') YIELD value AS review
CREATE (r:Review)
SET r = review
WITH r
//Match user to a review
MATCH (u:User {user_id: r.user_id})
CREATE (u)-[:HAS_REVIEW]->(r)
WITH r, u
//Match business to a review, and a user to a business
MATCH (b:Business {business_id: r.business_id})
//Merge here in case of multiple reviews
MERGE (u)-[:HAS_REVIEWED]->(b)
CREATE (b)-[:HAS_REVIEW]->(r)
Obviously - change labels/relationships to types you want, and it might need tuning depending on the size of data etc, so you might need to use apoc.periodic.iterate to work it.
Apoc is here if you need it (and you should use it!)
Related
I have a compound index as follows.
index({ account_id: 1, is_private: 1, visible_in_list: 1, sent_at: -1, user_id: 1, status: 1, type: 1, 'tracking.last_opened_at' => -1 }, {name: 'email_page_index'})
Then I have a query with these exact fields,
selector:
{"account_id"=>BSON::ObjectId('id'), "is_private"=>false, "visible_in_list"=>{:$in=>[true, false]}, "status"=>{:$in=>["ok", "queued", "processing", "failed"]}, "sent_at"=>{"$lte"=>2021-03-22 15:29:18 UTC}, "tracking.last_opened_at"=>{"$gt"=>1921-03-22 15:29:18 UTC}, "user_id"=>BSON::ObjectId('id')}
options: {:sort=>{"tracking.last_opened_at"=>-1}}
The winningPlan is the following
"inputStage": {
"stage": "SORT_KEY_GENERATOR",
"inputStage": {
"stage": "FETCH",
"filter": {
"$and": [
{
"account_id": {
"$eq": {
"$oid": "objectid"
}
}
},
{
"is_private": {
"$eq": false
}
},
{
"sent_at": {
"$lte": "2021-03-22T14:06:10.000Z"
}
},
{
"tracking.last_opened_at": {
"$gt": "1921-03-22T14:06:10.716Z"
}
},
{
"status": {
"$in": [
"failed",
"ok",
"processing",
"queued"
]
}
},
{
"visible_in_list": {
"$in": [
false,
true
]
}
}
]
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"user_id": 1
},
"indexName": "user_id_1",
"isMultiKey": false,
"multiKeyPaths": {
"user_id": []
},.....
And the rejected plan has the compound index and forms as follows
"rejectedPlans": [
{
"stage": "FETCH",
"inputStage": {
"stage": "SORT",
"sortPattern": {
"tracking.last_opened_at": -1
},
"inputStage": {
"stage": "SORT_KEY_GENERATOR",
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"account_id": 1,
"is_private": 1,
"visible_in_list": 1,
"sent_at": -1,
"user_id": 1,
"status": 1,
"type": 1,
"tracking.last_opened_at": -1
},
"indexName": "email_page_index",
"isMultiKey": false,
"multiKeyPaths": {
"account_id": [],
"is_private": [],
"visible_in_list": [],
"sent_at": [],
"user_id": [],
"status": [],
"type": [],
"tracking.last_opened_at": []
},
"isUnique": false,
The problem is that the winningPlan is slow, wouldn't be better if mongoid choose the compound index? Is there a way to force it?
Also, how can I see the execution time for each separate STAGE?
I am posting some information that can help resolve the issue of performance and use an appropriate index. Please note this may not be the solution (and the issue is open to discussion).
...Also, how can I see the execution time for each separate STAGE?
For this, generate the query plan using the explain with the executionStats verbosity mode.
The problem is that the winningPlan is slow, wouldn't be better if
mongoid choose the compound index? Is there a way to force it?
As posted the plans show a "stage": "SORT_KEY_GENERATOR", implying that the sort operation is being performed in the memory (that is not using an index for the sort). That would be one (or main) of the reasons for the slow performance. So, how to make the query and the sort use the index?
A single compound index can be used for a query with a filter+sort operations. That would be an efficient index and query. But, it requires that the compound index be defined in a certain way - some rules need to be followed. See this topic on Sort and Non-prefix Subset of an Index - as is the case in this post. I quote the example from the documentation for illustration:
Suppose there is a compound index: { a: 1, b: 1, c: 1, d: 1 }
And, all the fields are used in a query with filter+sort. The ideal query is, to have a filter+sort as follows:
db.test.find( { a: "val1", b: "val2", c: 1949 } ).sort( { d: 1 })
Note the query filter has three fields with equality condition (there are no $gt, $lt, etc.). Then the query's sort has the last field d of the index. This is the ideal situation where the index will be used for the query''s filter as well as sort operations.
In your case, this cannot be applied from the posted query. So, to work towards a solution you may have to define a new index so as to take advantage of the rule Sort and Non-prefix Subset of an Index.
Is it possible? It depends upon your application and the use case. I have an idea like this and it may help. Create a compound index like the follows and see how it works:
account_id: 1,
is_private: 1
visible_in_list: 1,
status: 1,
user_id: 1,
'tracking.last_opened_at': -1
I think having a condition "tracking.last_opened_at"=>{"$gt"=>1921-03-22 15:29:18 UTC}, in the query''s filter may not help for the usage of the index.
Also, include some details like the version of the MongoDB server, the size of collection and some platform details. In general, query performance depends upon many factors, including, indexes, RAM memory, size and type of data, and the kind of operations on the data.
The ESR Rule:
When using compound index for a query with multiple filter conditions and sort, sometimes the Equality Sort Range rule is useful for optimizing the query. See the following post with such a scenario: MongoDB - Index not being used when sorting and limiting on ranged query
I am trying to get only the matched data from nested array of elastic search class. but I am not able to get it..the whole nested array data is being returned as output.
this is my Query:-
QueryBuilders.nestedQuery("questions",
QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("questions.questionTypeId", quesTypeId)), ScoreMode.None)
.innerHit(new InnerHitBuilder());
I am using querybuilders to get data from nested class.Its working fine but not able to get only the matched data.
Request Body :
{
"questionTypeId" : "MCMC"
}
when questionTypeId = "MCMC"
this is the output i am getting..Here I want to exclude the output for which the questionTypeId = "SCMC".
output :
{
"id": "46",
"subjectId": 1,
"topicId": 1,
"subtopicId": 1,
"languageId": 1,
"difficultyId": 4,
"isConceptual": false,
"examCatId": 3,
"examId": 1,
"usedIn": 1,
"questions": [
{
"id": "46_31",
"pid": 31,
"questionId": "QID41336691",
"childId": "CID1",
"questionTypeId": "MCMC",
"instruction": "This is a single correct multiple choice question.",
"question": "Who holds the most english premier league titles?",
"solution": "Manchester United",
"status": 1000,
"questionTranslation": []
},
{
"id": "46_33",
"pid": 33,
"questionId": "QID41336677",
"childId": "CID1",
"questionTypeId": "SCMC",
"instruction": "This is a single correct multiple choice question.",
"question": "Who holds the most english premier league titles?",
"solution": "Manchester United",
"status": 1000,
"questionTranslation": []
}
]
}
As you have tagged this with spring-data-elasticsearch:
Support to return inner hits was recently added to version 4.1.M1 and so will be included in the next released version. Then in a SearchHit you will get the complete top level document, but in the innerHits property only the matching inner hits will be returned.
I need some pointers here.
I'm talking to an API that returns data based on specific parameters. I have been taking that response and flattening/editing it to fit my model and then saving into the database. Everything was working great until today that I started testing the live endpoint (no dummy data) and as it turns out, the format of the response changes.
For example, if a data set does not have a record, rather than including the value as nil, some responses are not including that key at all. This is breaking my logic to flatten and edit since now I'd need to check that every single field exists before I do anything.
Here are 2 snippets of what it can look like
Sample 1 - (No shared)
{
"request_info": {
"city_id": 76211,
"currency": "usd",
"req_type": "geom"
},
"data": {
"rental_counts": {
"counts": {
"private": {
"1": 17,
"2": 3,
"all": 20
},
"entire": {
"0": 2,
"1": 8,
"2": 11,
"3": 16,
"4": 14,
"5": 6,
"all": 57
}
},
}
}
}
Sample 2 (includes Shared)
{
"request_info": {
"city_id": 76211,
"currency": "usd",
"req_type": "geom"
},
"data": {
"rental_counts": {
"counts": {
"private": {
"1": 17,
"2": 3,
"all": 20
},
"entire": {
"0": 2,
"1": 8,
"2": 11,
"3": 16,
"4": 14,
"5": 6,
"all": 57
},
"shared": {
"0": 2,
"1": 8,
"all": 10
}
},
}
}
}
The changes I believe can happen at any level and for any key (parent or child). I'm sure I'm not the first one to run into something like this. What is the best way to manage it? Is there some method or gem that would help with parsing json and getting it into a standardized model whether the data keys are there or not?
I had been looking at Roar but still don't quite understand how it works very well. Is this something Roar could handle or would the json object need to be pre-defined and not dynamic?
I found a simpler solution than roar or deserializers. Ruby's slice method allows you to only select pre defined keys and ignore all others. I'm calling this method after flattening my hash but before using active record to import.
I have a spreadsheet with Apple Podcasts URLs.
What I want to do is to get newest podcast date.
So basically content of "sort-value" tag of first grid cell element:
<td role="gridcell" sort-value="2017/12/22" class="release-date">
Can it be done with IMPORTHTML function?
Example podcast URL: https://itunes.apple.com/us/podcast/modern-sewciety-podcast/id742393907
How about this?
=INDEX(IMPORTXML(A1, "//td[#class='release-date']//span[#class='text']"), 1)
https://itunes.apple.com/us/podcast/modern-sewciety-podcast/id742393907 is in a cell "A1".
XPath is //td[#class='release-date']//span[#class='text'].
The top value is retrieved using INDEX.
Edit :
About the reason that the value of cell becomes 43091, It is considered that it is due to the cell format. In order to modify this situation, please try the following 2 patterns.
Modify the format for the cell with 43091.
On the spreadsheet, please select "Format" -> "Number" -> "Date"
If you want to other date format, please customize it.
Use this XPath.
=INDEX(IMPORTXML(A1, "//td[#class='release-date']/#sort-value"), 1)
In this XPath, the format of date is different from //td[#class='release-date']//span[#class='text'].
I can't give the specifics for Google Sheets, but I can tell you that you can more quickly get this data from using the Apple Media Lookup API.
Using the ID number, 742393907, make a call to this API endpoint: https://itunes.apple.com/lookup?id=742393907
That will return the following JSON:
{
"resultCount": 1,
"results": [
{
"wrapperType": "track",
"kind": "podcast",
"collectionId": 742393907,
"trackId": 742393907,
"artistName": "Stephanie Kendron: Modern Creative blogger and podcaster",
"collectionName": "Modern Sewciety Podcast",
"trackName": "Modern Sewciety Podcast",
"collectionCensoredName": "Modern Sewciety Podcast",
"trackCensoredName": "Modern Sewciety Podcast",
"collectionViewUrl": "https://podcasts.apple.com/us/podcast/modern-sewciety-podcast/id742393907?uo=4",
"feedUrl": "http://www.modernsewciety.com/feed/podcast",
"trackViewUrl": "https://podcasts.apple.com/us/podcast/modern-sewciety-podcast/id742393907?uo=4",
"artworkUrl30": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/db/f3/1d/dbf31d7f-3fae-84a6-c4b6-91adebc2394d/mza_4725542195606703711.jpg/30x30bb.jpg",
"artworkUrl60": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/db/f3/1d/dbf31d7f-3fae-84a6-c4b6-91adebc2394d/mza_4725542195606703711.jpg/60x60bb.jpg",
"artworkUrl100": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/db/f3/1d/dbf31d7f-3fae-84a6-c4b6-91adebc2394d/mza_4725542195606703711.jpg/100x100bb.jpg",
"collectionPrice": 0.00,
"trackPrice": 0.00,
"trackRentalPrice": 0,
"collectionHdPrice": 0,
"trackHdPrice": 0,
"trackHdRentalPrice": 0,
"releaseDate": "2020-03-08T16:39:00Z",
"collectionExplicitness": "cleaned",
"trackExplicitness": "cleaned",
"trackCount": 50,
"country": "USA",
"currency": "USD",
"primaryGenreName": "Design",
"contentAdvisoryRating": "Clean",
"artworkUrl600": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/db/f3/1d/dbf31d7f-3fae-84a6-c4b6-91adebc2394d/mza_4725542195606703711.jpg/600x600bb.jpg",
"genreIds": [
"1402",
"26",
"1301"
],
"genres": [
"Design",
"Podcasts",
"Arts"
]
}
]
}
You would want the releaseDate field, which is pulled from the podcast's latest episode.
I am getting JSON data from a webservice and try to store that in Core Data with Magical Record. I read the great post (and only documentation?) "Importing data made easy" by Saul Mora but I still do not really understand what I need to do to get all data in my entities.
Here is the JSON the web service returns:
{
"ApiVersion": 4,
"AvailableFileSystemLibraries": [
{
"Id": 10,
"Name": "Movie Shares",
"Version": "0.5.4.0"
},
{
"Id": 11,
"Name": "Picture Shares",
"Version": "0.5.4.0"
},
{
"Id": 5,
"Name": "Shares",
"Version": "0.5.4.0"
},
{
"Id": 9,
"Name": "Music Shares",
"Version": "0.5.4.0"
}
],
"AvailableMovieLibraries": [
{
"Id": 3,
"Name": "Moving Pictures",
"Version": "0.5.4.0"
},
{
"Id": 7,
"Name": "MyVideo",
"Version": "0.5.4.0"
}
],
"AvailableMusicLibraries": [
{
"Id": 4,
"Name": "MyMusic",
"Version": "0.5.4.0"
}
],
"AvailablePictureLibraries": [
{
"Id": 8,
"Name": "Picture Shares",
"Version": "0.5.4.0"
}
],
"AvailableTvShowLibraries": [
{
"Id": 6,
"Name": "MP-TVSeries",
"Version": "0.5.4.0"
}
],
"DefaultFileSystemLibrary": 5,
"DefaultMovieLibrary": 3,
"DefaultMusicLibrary": 4,
"DefaultPictureLibrary": 0,
"DefaultTvShowLibrary": 6,
"ServiceVersion": "0.5.4"
}
The entities I want to store that data in look like this:
There is also a Server entity with a 1:1 relationship to ServerInfo.
What I want to do:
Store basic data (ApiVersion, ...) in ServerInfo. This I already got to work.
Store each object in AvailableXYLibraries in BackendLibrary (1:n relationship from ServerInfo).
Set type based on the XY part of AvailableXYLibraries, for example "movie" for AvailableMovieLibraries.
Set defaultLibrary to true if this library is referenced by DefaultXYLibrary.
Set providerId to servername + LibraryId as there are multiple servers that can have BackendLibraries with the same numeric ID.
Is this possible with Magical Record? I guess I need to implement some of the import hooks and set some user info keys, but everything I read doesn't really tell me where to set what user info key or implement which method where and how.
I hope this made sense and that you can give me some hints :) Thanks!
The structure of this data is quite a bit different from your Core Data model. What you'll most likely have to do is iterate a bit on the dictionary. That is, there are various collections of library data, eg. FileSystemLibraries, AvailableMovieLibraries, etc. You'll have to get the array out of those keys, and then map your entities as I described in the article. In order to launch the process, you'll have to call
[BackendLibrary importFromArray:arrayFromDownloadedDictionary];
where the arrayFromDownloadedDictionary is each array in the example dictionary you've posted. Once you give the array to MagicalRecord, and provided the proper field mapping, MagicalRecord will then import and create all the entities for you at that point.
Make sure you map "Id" to BackendLibary.id, "Name" to BackendLibrary.name, and "Version" to BackendLibrary.version