Iterating throug nodes in Neo4j using py2neo - py2neo

I am working on using Neo4j with py2neo for analyzing Twitter data. I'm a newbie in all of these, so the question might be pretty basic. But I could not find the answer in any of the documentations.
I have two csv files, one with 100 followers, the other with about 22000 tweets.
For the tweet I have informations like it is a reply to another tweet and the other users who have been mentioned in this tweet.
I want to add followers and tweets as nodes, then using the reply_to and the mentions_user field of the tweets to add connections between tweets (reply_to) and tweet and user (mentions).
Adding the nodes works well with batch. However, when I want to iterate through all Tweets using py2neo to add the relationships I get OutOfMemoryError: Java heap space.
I'm trying to iterate through the tweets like this:
for tweet in graph.find("Tweet")
My questions are now:
a) Is there another way in py2neo to iterate through (a lot of) nodes?
b) A little broader: I read in the py2neo documentation it is better to use cypher transactions than batch. Should I do that and could that also help for a)?
Thanks in advance for any help!
KMM

There are certainly ways to load bulk data effectively but this particular method (finding all items of a particular "type") is not one that takes advantage of the graph structure of the database and therefore won't scale well.
You can of course increase the Java heap size if this is a one-off and you may get away with it. But your best bet is probably to look into the LOAD CSV operation: http://neo4j.com/docs/stable/query-load-csv.html

Related

Graph API: Get all versions of all OneDrive items in single query

Is there any way in which I can get all the versions of all OneDrive items in my Drive using Graph API. And I want a single query to complete this work.
DriveItemVersion resource type doesn't seem to support this (https://learn.microsoft.com/en-us/graph/api/resources/driveitemversion?view=graph-rest-1.0). It looks like we need a separate query to get versions of each OneDrive item. This is not a very efficient way.
Let me know if there is any workaround/fix for this problem.
This isn't possible and would be extremely inefficient. Just as an example, I have ~100k DriveItems in my OneDrive. Attempting to retrieve all of the items, and each version would take an exceedingly long time.
It is far more efficient to retrieve the minimum DriveItem properties you need using a Delta query. You can then process individual DriveItems in batches. Once complete you can then retrieve another Delta and processes any files that have changed in the meantime.
I would also suggest taking another look at your requirements. There are very few scenarios where it makes sense to query every file in a Drive. You shouldn't attempt to apply the same patterns used for local/networked storage to cloud storage solutions (be it OneDrive, Google Drive, DropBox, etc.). They are much more akin to a database of binaries than a file system.

Searching for users using Parse

What would be the better approach to let a user search for other users who use the app (using Parse.com as the backend) :
Import all the the data in the _User table then filter t in the app when using the UISearchBar
Querying parse for the search term and loading the results to the tableview
There is no "right" answer. It depends entirely on how you define "better."
Option 1 likely produces superficially the best user experience, in the sense that filtering a list on the fly looks a lot more responsive. But you have to schedule downloading the user list for when the user isn't already trying to search.
Option 2 is likely more efficient. Less bandwidth, less storage. But the user had to be online to search and you probably can't do a "real time" filter unless you're on a fast network.
There may be other factors also. I didn't want to expose a list of users, for example, so I went for option 2.

Visualization of a highly linked graph with neo4j

I'm using Neo4j for a research project and am struggling with a small problem.
The underlying data is a highly linked graph and I'm not able to visualize it in a good manner. As you can see in the screenshot, the relationships are overlapping and I can always just click the top one for further information. I already tried two approaches: try to hide relationships in the visualized result (Neo4j Browser with Cypher queries) and I was looking for alternatives to neo4j's built in visualization.
So my desired approach would be to just hide relationships from the visualized result. But even queries such as MATCH (a)-[t]->(b) WHERE t.probability > 0.1 RETURN a,b,t return less nodes and still display all relationships between this few nodes.
Does anybody know how to hide different relationships in the result? Or if it is not possible with neo4j's built in solution an open source or at least free visualization tool recommendation would be highly appreciated.
Some info about my graph: it displays a transition map (A Bayesian Network) of 10 zones and the probabilities of moving from one zone to another. There are a couple of relationships between each node, representing different time intervals. So in example 'Moving from A to B in less than an hour has a probability of 42%'
Neo4j server does an extra query for relationships after it retrieved the nodes, I'm not aware of an easy way to prevent that.
You could use something that uses a different approach to visualization.
E.g. like my demo app here that uses alchemy.js for visualization:
http://jexp.github.io/cy2neo/
Zonic,
If you click on a node or relationship, you will get a pop-up that has an option to view the graph stylesheet. From the dialog that pops up, you can download the contents, then modify the relationships that you don't wish to see to make the lines and text white. Drag and drop the modified, downloaded .grass file back into the stylesheet dialog, and see if that helps.
You could also try the gephi application and see what that does for you. It's free, and it is focused on visualization.
Grace and peace,
Jim
maybe you would like to try external applications as stated in this answer:
neo4j, Sorry! Too many neighbours
Do you mean basic filtering of the relationships, like this...
MATCH (a:Person)-[t:IS_RELATED_TO]->(b:Person) WHERE t.probability > 0.1 RETURN a,b,t
You can hide the extra relationships by turning off auto-complete with the switch in the bottom-right corner. By default Neo4j also fetches and displays relationships between returned nodes, even if they were not part of your query. With auto-complete turned off, Neo4j will only displayed the relationships returned by the actual query.

iOS Core Data search speed improvement

I am struggling improve the search speed of my iOS app which uses core data. Can anyone help or suggest alternative solutions to improve my search speed? I've listed details to my situation below.
Project Details
I am currently creating a data reference app which uses core data with a preloaded SQLite database. I want to be able to search on one or more attributes of an entity which could contain over 100000 records and return results quickly.
The best results I have achieved so far(searching still quiet slow though) is to load a view with a search display controller, set the fetch limit(currently 100) for the fetch request of the fetchResultController. I've also used search scopes to simplify the predicates. I do use the 'contains' keyword in my predicates, but I am not sure how to implement the suggestion in session 137 of WWDC 2010 and what keywords I should be storing or how many I should store.
Here is a link to one of my classes,
http://pastebin.com/cHHicc1s
Thank you for your time and help.
Regards
Jing Jing Tao
You may want to normalize an existing attribute as a new attribute then index it. Remove the "CONTAINS" from your predicate and instead use >= or < etc values. Also, normalize the search text so that the comparison balances. Apple documents all this in the 'Derived Property' example and in WWDC 2010 session # 118 video.
If you are doing large searches on attributes, you should create indexes. You can do this in Xcode when you define the model. Click on the entity, and right under where you specify the entity name, you can create additional indexes.
Note, however, that you will incur additional file size overhead, and inserts/deletes will also take a bit more time. But, your searches will be very fast.

how to collect millions of tweets?

I was browsing through fflick, nicely made app on top of twitter. How do they
collect millions of tweets?
accurately (mostly) categorize tweets into postive or negative sentiments?
The collect millions of tweets probably by crawler twitter with their API. Probably searching with Streaming API for keywords related to films, or just searching their own timeline looking for what their followers have to say about films.
Don't know. Probably using some natural language processing techniques from good old AI textbooks. :-)
2) look for smileys - ;), :), :D, :(
A few places provide the latter vas a service now. Check out ViralHeat and Evri:
http://www.viralheat.com/home/features
http://www.readwriteweb.com/archives/sentiment_analysis_is_ramping_up_in_2009.php

Resources