Queries make Twitter-stream application too slow in saving data - twitter

I have an application which streams Twitter data which are stored in a Neo4j database. The data I store regard tweets, users, hashtag and their relationships (user posts tweet, tweet tags hashtags, user retweets tweet).
Now, each time I get a new tweet what I do is:
Check if the database already contains the tweet: if so, I update it with the new information (retweet count, like count), else I save it
Check if the database already contains the user: if so, I update it with the new infos, else I save it
Check if the database already contains the hashtag: if it doesn't, I add it
And so on, same process for saving the relationships.
Here are the queries:
static String cqlAddTweet = "merge (n:Tweet{tweet_id: {2}}) on create set n.text={1}, n.location={3}, n.likecount={4}, n.retweetcount={5}, n.topic={6}, n.created_at={7} on match set n.likecount={4}, n.retweetcount={5}";
static String cqlAddHT = "merge (n:Hashtag{text:{1}})";
static String cqlHTToTweet = "match (n:Tweet),(m:Hashtag) where n.tweet_id={1} and m.text={2} merge (n)-[:TAGS]->(m)";
static String cqlAddUser = "merge (n:User{user_id:{3}}) on create set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6} on match set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";
static String cqlUserToTweet = "match (n:User),(m:Tweet) where m.tweet_id={2} and n.user_id={1} merge (n)-[:POSTS]->(m)";
static String cqlUserRetweets = "match (n:Tweet{tweet_id:{1}}), (u:User{user_id:{2}}) create (u)-[:RETWEETS]->(n)";
Since it is very slow in saving data, I suppose that this system can have better performances if I didn't run all those queries which scan the data each time.
Do you have any suggestion to improve my application?
Thank you and excuse me in advance if this may seem silly.

Make sure you have indexes (or uniqueness constraints, if appropriate) on the following label/property pairs. That will allow your queries to avoid scanning through all nodes with the same label (when starting a query).
:Tweet(tweet_id)
:Hashtag(text)
:User(user_id)
By the way, a couple of your queries can be simplified (but this should not affect the performance):
static String cqlAddTweet = "MERGE (n:Tweet{tweet_id: {2}}) ON CREATE SET n.text={1}, n.location={3}, n.topic={6}, n.created_at={7} SET n.likecount={4}, n.retweetcount={5}";
static String cqlAddUser = "MERGE (n:User{user_id:{3}}) SET n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";

Related

Neo4j cypher query fails with unknown syntax error

I have the following paramObj and dbQuery
paramObj = {
email: newUser.email,
mobilenumber: newUser.telephone,
password: newUser.password,
category: newUser.category,
name: newUser.name,
confirmuid: verificationHash,
confirmexpire: expiryDate.valueOf(),
rewardPoints: 0,
emailconfirmed: 'false',
paramVehicles: makeVehicleArray,
paramVehicleProps: vehiclePropsArray
}
dbQuery = `CREATE (user:Person:Owner {email:$email})
SET user += apoc.map.clean(paramObj,
['email','paramVehicles','paramVehiclesProps'],[])
WITH user, $paramVehicles AS vehicles
UNWIND vehicles AS vehicle
MATCH(v:Vehicles {name:vehicle})
CREATE UNIQUE (user)-[r:OWNS {since: timestamp()}]->(v)
RETURN user,r,v`;
Then I tried to execute
commons.session
.run(dbQuery, paramObj)
.then(newUser => {
commons.session.close();
if (!newUser.records[0]) {........
I am getting
Error: {"code":"Neo.ClientError.Statement.SyntaxError","name":"Neo4jError"}
which doesn't direct me anywhere. Can anyone tell me what am I doing wrong here?
This is actually the first time I am using the query format .run(dbQuery, paramObj) but this format is critical to my use case. I am using Neo4j 3.4.5 community with apoc plugin installed.
Ok...so I followed #inversFalcon suggestion to test in browser and came up with following parameters and query that closely match the ones above:
:params paramObj:[{ email:"xyz123#abc.com", mobilenumber:"8711231234",password:"password1", category:"Owner",name:"Michaell",vehicles:["Toyota","BMW","Nissan"],vehicleProps: [] }]
and query
PROFILE
CREATE (user:Person:Owner {email:$email})
SET user += apoc.map.clean($paramObj, ["email","vehicles","vehicleProps"],[])
WITH user, $vehicles AS vehicles
UNWIND vehicles AS vehicle
MATCH(v:Vehicles {name:vehicle})
MERGE (user)-[r:OWNS {since: timestamp()}]->(v)
RETURN user,r,v;
Now I get
Neo.ClientError.Statement.TypeError: Can't coerce `List{Map{name -> String("Michaell"), vehicles -> List{String("Toyota"), String("BMW"), String("Nissan")},.......
I also reverted to neo4j 3.2 (re: an earlier post by Mark Needham) and got the same error.
You should try doing an EXPLAIN of the query using the browser to troubleshoot it.
A few of the things I'm seeing here:
You're referring to paramObj, but it's not a parameter (rather, it's the map of parameters you're passing in, but it itself is not a parameter you can reference in the query). If you need to reference the entire set of parameters being passed in, then you need to use nested maps, and have paramObj be a key in the map that you pass as the parameter map (and when you do use it in the query, you'll need to use $paramObj)
CREATE UNIQUE is deprecated, you should use MERGE instead, though be aware that it does behave in a different manner (see the MERGE documentation as well as our knowledge base article explaining some of the easy-to-miss details of how MERGE works).
I am not sure what caused the coercion error to disappear but it did with the same query and I got a "expected parameter error" this was fixed by using $paramObj.email, etc. so the final query looks like this:
CREATE (user:Person:Owner {email: $paramObj.email})
SET user += apoc.map.clean($queryObj, ["email","vehicles","vehicleProps"],[])
WITH user, $paramObj.vehicles AS vehicles
UNWIND vehicles AS vehicle
MATCH(v:Vehicles {name:vehicle})
MERGE (user)-[r:OWNS {since: timestamp()}]->(v)
RETURN user,r,v;
which fixed my original problem of how to remove properties from a map when using SET += map.

"EntityNotFoundException: Unable to load RELATIONSHIP with id" when saving RelationshipEntity (with huge generated cypher query)

I am using spring-data-neo4j 4.2.0.M1 and neo4j-ogm 2.0.4 with neo4j 3.1.0-M04.
The application is generally working, except for one case where I try to save a collection of modified RelationshipEntities.
The code is sth. like this:
List<Relationship> updatedRelationships = new ArrayList<>();
for(Relationship relationship : modifiedRelationships)
{
relationship = relationshipRepository.load(relationship);
relationship.setValue("value");
updatedRelationships.add(relationship);
}
relationshipRepository.save(relationships);
The RelationshipEntity is annotated with #RelationshipEntity and has a few properties in addition to the #StartNode and #EndNode. Only the property mentioned above is changed though. The RelationshipEntity is loaded inside the loop because I previously noticed lost information (namely value of other properties) when executing this.
Note that the above mentioned code is executes for many RelationshipEntities in succession. Each relationship (probably) occurs only once, but start and end nodes probably occur several times. To my knowledge, no relationship is deleted though.
The exception I get is:
Caused by: org.neo4j.kernel.api.exceptions.EntityNotFoundException: Unable to load RELATIONSHIP with id 20683203.
at org.neo4j.kernel.impl.api.store.DiskLayer.relationshipVisit(DiskLayer.java:432)
at org.neo4j.kernel.impl.api.store.CacheLayer.relationshipVisit(CacheLayer.java:326)
at org.neo4j.kernel.impl.api.StateHandlingStatementOperations.relationshipVisit(StateHandlingStatementOperations.java:1409)
at org.neo4j.kernel.impl.api.ConstraintEnforcingEntityOperations.relationshipVisit(ConstraintEnforcingEntityOperations.java:416)
at org.neo4j.kernel.impl.api.OperationsFacade.relationshipVisit(OperationsFacade.java:493)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.getRelationshipById(GraphDatabaseFacade.java:300)
... 104 common frames omitted
The query that is executed before (which is probably the "save" query) is huge and exceed the character limit here (sth. like 200k characters). Apparently the query touches where more relationships than necessary (from business logic point-of-view) since only about 30 entities are actually saved. I would assume that the resulting query (or queries if updates are done per entity) are rather brief.
2016-08-28 20:16:33,007 I [pool-4-thread-1 ] (EmbeddedRequest.java:155) Request: START r=rel({relIds}) FOREACH (row in filter(row in {rows} where row.relId = id(r)) | SET r += row.props) RETURN ID(r) as ref, ID(r) as id, {type} as type with params {relIds=[13744338, 19099951, 12570789, 12570785, 13744377, 13648126, 12570765, 20627727, 13744356, 20627724, 12570760, 19263773, 19257628, 20113678, 19099932, 19259756, 18796874, 13783174, 19097972, 19083644, 19099970, 19097921, 19077446, 19263810, 13744312, 20568405, 20904270, 19097937, 12570827, 20627779, 20648258, 12570816, 20683195, 19259812, 20683194, 20683193, 20683192, 19083690, 20683186, 20683191, 19259819, 18819471, 20683178, 20683177, 12570669, 20683176, 19276210, 19933607, 20683171, 18844038, 19100089, 20683174, 20683173, 20683163, 20683162, 20683161, 13744242, 19257729, 12570649, 20683165, 20683164, 19087754, 21703141, 12570641, 8341711, 19259796, 8704051, 19915155, 19261851, 13783062, 13783063, 19091955, 18182597, 19276276, 19276275, 20623852, 20607468, 20623853, 19100155, 19233277, 13783048, 19261946, 12570719, 21789101, 12570718, 19075526, 19259842, 19257807, 12570707, 13715516, 19098061, 19261908, 20683208, 20683215, 19100118, 20683212, 20683203, 19276254, 20683201, 20683207, 19091934, 20683206, 19261915, 19097639, 19101736, 19101749, 18821129, 19097659, 19124284, 13662709, 13744628, 19052549, 19089427, 13744612, 19265563, 19251300, 19089509, 19251298, 20631665, 19251305, 19265642, 13744513, 19261558, 19261511, 19265606, 19081291, 18903113, 18903114, 19251273, 8341775, 12597685, 13744548, 19081308, 18725021, 18725020, 19273892, 19099808, 19089572, 19097772, 13744449, 13683011, 18178177, 19273905, 19093694, 18178231, 19124358, 20633756, 13744502, 19081356, 18651311, 19093661, 20562171, 19263725, 20625639, 19099901, 20631774, 20676819, 18651383, 20676822, 20676821, 20676820, 19097811, 19099862, 13744428, 20631751, 18178280, 18668312, 19100453, 19088171, 20708148, 19143487, 19088184, 19094334, 18668349, 13744883, 19145485, 20607750, 19094301, 19086108, 13744792, 20611958, 19143528, 13662849, 13744829, 12571346, 20611918, 20611919, 18811753, 19100506, 13744813, 19084195, 13662806, 20708275, 19098546, 20612001, 13744752, 20708253, 12595823, 20611976, 19147673, 19258343, 19274725, 19084262, 19082212, 19096548, 20591606, 19086317, 13662720, 8348332, 19274738, 8348329, 19096571, 21703569, 19440630, 13744654, 21824427, 13744701, 19258320, 20612032, 19086296, 19080158, 19282466, 19145249, 19261996, 20607539, 12596170, 19282472, 18776588, 19100208, 12596183, 18182658, 19233341, 19278395, 19096126, 19098115, 20640284, 18844217, 19255810, 19259919, 19257864, 20623892, 19091980, 19933697, 19282450, 19100180, 19261981, 12596219, 12596113, 19255924, 20707949, 12596118, 19098228, 18704970, 12596122, 19278458, 19096190, 19278456, 19253826, 19278412, 13745087, 19100241, 13745066, 18704995, 19278500, 13744981, 5954519, 19094199, 19143356, 13744970, 12598116, 18840242, 13745006, 18676445, 18008789, 19096298, 18676426, 20607724, 13744906, 13755199, 19094227, 12596419, 19098918, 19256621, 19090736, 21075287, 19100929, 21851496, 20876568, 13681912, 12596463, 12596465, 19090704, 10951825, 12596471, 13681897, 13753581, 19094814, 12596352, 21703948, 21695756, 18699605, 19256693, 18818378, 12596376, 19090755, 19256647, 13681844, 19082583, 18836839, 18699621, 12596409, 20618681, 21544395, 19916202, 12596299, 12596310, 19436940, 19099014, 19094918, 19916170, 13681782, 12596335, 20680073, 13681762, 13681763, 19099028, 19094938, 21081473, 13681682, 20680177, 12596242, 19099126, 19500540, 21081496, 10492993, 19099087, 21081517, 19099094, 21704112, 19098665, 18680849, 12596685, 12596689, 19274804, 20648995, 19137597, 21048411, 19088387, 19262470, 20657183, 19086357, 19258397, 18680869, 12596731, 19088413, 19272807, 19274848, 19272811, 12596622, 18811984, 15797667, 19096694, 19082357, 19262579, 19274875, 19137604, 12596642, 19274830, 19098696, 13682107, 12596651, 19096655, 20632650, 19088474, 19274845, 19262555, 19100834, 13682007, 19098794, 19100851, 12596565, 20556972, 19254450, 20597926, 12598622, 20597925, 20649114, 19100800, 13682036, 19100806, 12596582, 18703539, 20638856, 20598010, 18703582, 19094763, 19100905, 19096808, 20634857, 20597991, 5877179, 5877178, 20597977, 5877181, 19098822, 12596527, 12596532, 19199781, 19265313, 19261228, 20625200, 19257134, 20625201, 18714376, 19085108, 19253054, 19253048, 19265339, 20637459, 20637456, 19085074, 21081974, 8316482, 20598534, 18714402, 19107685, 19253090, 20615029, 19097462, 19263346, 20621152, 19263352, 19259207, 13729470, 19085140, 20688830, 19251116, 19259304, 13678173, 20615087, 12596830, 19097474, 21082087, 12596840, 19263368, 19251093, 8701488, 19267475, 8349384, 12572165, 8349360, 12596751, 19077119, 12596765, 20625380, 19077057, 19089350, 21825447, 21702567, 13682208, 12596785, 8316559, 18178020, 19253207, 20688847, 12596788, 19267536, 20688838, 12570558, 19232295, 12570550, 13783001, 20643352, 20694547, 19095051, 20643338, 19232272, 12570505, 20641280, 20694529, 20641284, 19099164, 20821624, 20821626, 20631165, 20821619, 12570606, 19439229, 12570601, 18820674, 19232327, 12570588, 20694621, 20641362, 20119134, 20631115, 20680264, 20618831, 19093080, 18824862, 19256994, 7325670, 20821668, 19257017, 13782863, 16494427, 20620952, 19256967, 20637331, 18030271, 8267731, 19256977, 20670095, 19099360, 20637433, 19261170, 19265276, 20907749, 18822910, 20621021, 19099339, 19252938, 19936961, 19099345, 19109599, 19257048], rows=[omitted]
I've tried to load the relationship with that id directly, but none exists. The same code executes fine for other RelationshipEntities but repeatedly fails for either this or one of a handful other ones.
Any ideas as to what could cause this or how this can be better debugged?
I think I somehow solved this with the following steps:
Replaced saving the RelationshipEntity with saving the modified NodeEntity
Making such modifying operations sequential (previously this could happen in parallel)
Encapsulate the modifying operation in a transaction
Fixed a bug where in the same transaction the same entity was saved twice (without changing again in the meantime)
Fetching the entity again at the beginning of the transaction in order to have the latest state available
Since I was prettty much in the dark about this topic until it finally worked, I am not sure if all of the steps actually helped solving this. It may actually have been only a subset.
What I can see though now is, that the huge update queries are now smaller (albeit still quite big) but actually seem to contain "real" updates instead of mostly "null" properties. I assume that previously it didn't really contain an update and was instead overriding properties with "null". The fact that this is now working is probably related to the fact, that the entity is now updated before beginning to modify it and that no other modifying operation can run in parallel.
I had the same problem. For me it was simply the neo4j-ogm-embedded-driver
version I had to include in my pom. The one I defined overwrote the one
defined in spring-data-neo4j.
If you only save the relationshipEntity,you could only using next snippet:
List<Relationship> updatedRelationships = new ArrayList<>();
for(Relationship relationship : modifiedRelationships)
{
relationship = relationshipRepository.load(relationship);
relationship.setValue("value");
updatedRelationships.add(relationship);
}
relationshipRepository.save(updatedRelationships,0);
it would save the related properties on relationshipEntity and meanwhile ignore any related entities.

Groovy name queried

I got a domain like this:
ZZPartAndTeam
String parts
String team
Parts may have many team.
For ex: part:part1 team:10
part:part1 team:20
part:part2 team:30
How can I query in the domain that get all parts which have multi team?
result:part:part1 team:10
part:part1 team:20
Thanks.
The HAVING clause is not supported by Hibernate Criteria. A way around is to use DetachedCriteria.
import org.hibernate.criterion.DetachedCriteria as HDetachedCriteria
query: { builder, params ->
// This query counts the number of teams per part
HDetachedCriteria innerQry = HDetachedCriteria.forClass(ZZPartAndTeam.class)
innerQry.setProjection(Projections.projectionList()
.add(Projections.count('team').as('teamCount'))
)
innerQry.add(HRestrictions.eqProperty('part', 'outer.part')
// Using innerQuery, this criteria returns the parts having more than one team.
HDetachedCriteria outerQry = HDetachedCriteria.forClass(ZZPartAndTeam.class, 'outer')
outerQry.setProjection(Projections.projectionList()
.add(Projections.distinct(Projections.property('part').as('part')))
)
outerQry.add(Subqueries.gt(1, innerQry))
builder.addToCriteria(Property.forName('part').in(outerQry))
}

Returning Updated Results from DBSet.SqlQuery

I want to use the following method to flag people in the Person table so that they can be processed. These people must be flagged as "In Process" so that other threads do not operate on the same rows.
In SQL Management Studio the query works as expected. When I call the method in my application I receive the row for the person but with the old status.
Status is one of many navigation properties off of Person and when this query returns it is the only property returned as a proxy object.
// This is how I'm calling it (obvious, I know)
var result = PersonLogic.GetPeopleWaitingInLine(100);
// And Here is my method.
public IList<Person> GetPeopleWaitingInLine(int count)
{
const string query =
#"UPDATE top(#count) PERSON
SET PERSON_STATUS_ID = #inProcessStatusId
OUTPUT INSERTED.PERSON_ID,
INSERTED.STATUS_ID
FROM PERSON
WHERE PERSON_STATUS_ID = #queuedStatusId";
var queuedStatusId = StatusLogic.GetStatus("Queued").Id;
var inProcessStatusId = StatusLogic.GetStatus("In Process").Id;
return Context.People.SqlQuery(query,
new SqlParameter("count", count),
new SqlParameter("queuedStateId", queuedStateId),
new SqlParameter("inProcessStateId", inProcessStateId)
}
// update | if I refresh the result set then I get the correct results
// but I'm not sure about this solution since it will require 2 DB calls
Context.ObjectContext().Refresh(RefreshMode.StoreWins, results);
I know it is an old question but this could help somebody.
It seems you are using a global Context for your query, EF is designed to retain cache info, if you allways need fresh data must use a fresh context to retrieve it. as this:
using (var tmpContext = new Contex())
{
// your query here
}
This create the context and recycle it. This means no cache was stored and next time it gets fresh data from database not from cache.

How to implement pagination when using amazon Dynamo DB in rails

I want to use amazon Dynamo DB with rails.But I have not found a way to implement pagination.
I will use AWS::Record::HashModel as ORM.
This ORM supports limits like this:
People.limit(10).each {|person| ... }
But I could not figured out how to implement following MySql query in Dynamo DB.
SELECT *
FROM `People`
LIMIT 1 , 30
You issue queries using LIMIT. If the subset returned does not contain the full table, a LastEvaluatedKey value is returned. You use this value as the ExclusiveStartKey in the next query. And so on...
From the DynamoDB Developer Guide.
You can provide 'page-size' in you query to set the result set size.
The response of DynamoDB contains 'LastEvaluatedKey' which will indicate the last key as per the page size. If response does't contain 'LastEvaluatedKey' it means there are no results left to fetch.
Use the 'LastEvaluatedKey' as 'ExclusiveStartKey' while fetching next time.
I hope this helps.
DynamoDB Pagination
Here's a simple copy-paste-run proof of concept (Node.js) for stateless forward/reverse navigation with dynamodb. In summary; each response includes the navigation history, allowing user to explicitly and consistently request either the next or previous page (while next/prev params exist):
GET /accounts -> first page
GET /accounts?next=A3r0ijKJ8 -> next page
GET /accounts?prev=R4tY69kUI -> previous page
Considerations:
If your ids are large and/or users might do a lot of navigation, then the potential size of the next/prev params might become too large.
Yes you do have to store the entire reverse path - if you only store the previous page marker (per some other answers) you will only be able to go back one page.
It won't handle changing pageSize midway, consider baking pageSize into the next/prev value.
base64 encode the next/prev values, and you could also encrypt.
Scans are inefficient, while this suited my current requirement it won't suit all!
// demo.js
const mockTable = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
const getPagedItems = (pageSize = 5, cursor = {}) => {
// Parse cursor
const keys = cursor.next || cursor.prev || [] // fwd first
let key = keys[keys.length-1] || null // eg ddb's PK
// Mock query (mimic dynamodb response)
const Items = mockTable.slice(parseInt(key) || 0, pageSize+key)
const LastEvaluatedKey = Items[Items.length-1] < mockTable.length
? Items[Items.length-1] : null
// Build response
const res = {items:Items}
if (keys.length > 0) // add reverse nav keys (if any)
res.prev = keys.slice(0, keys.length-1)
if (LastEvaluatedKey) // add forward nav keys (if any)
res.next = [...keys, LastEvaluatedKey]
return res
}
// Run test ------------------------------------
const runTest = () => {
const PAGE_SIZE = 6
let x = {}, i = 0
// Page to end
while (i == 0 || x.next) {
x = getPagedItems(PAGE_SIZE, {next:x.next})
console.log(`Page ${++i}: `, x.items)
}
// Page back to start
while (x.prev) {
x = getPagedItems(PAGE_SIZE, {prev:x.prev})
console.log(`Page ${--i}: `, x.items)
}
}
runTest()
I faced a similar problem.
The generic pagination approach is, use "start index" or "start page" and the "page length". 
The "ExclusiveStartKey" and "LastEvaluatedKey" based approach is very DynamoDB specific.
I feel this DynamoDB specific implementation of pagination should be hidden from the API client/UI.
Also in case, the application is serverless, using service like Lambda, it will be not be possible to maintain the state on the server. The other side is the client implementation will become very complex.
I came with a different approach, which I think is generic ( and not specific to DynamoDB)
When the API client specifies the start index, fetch all the keys from
the table and store it into an array.
Find out the key for the start index from the array, which is
specified by the client.
Make use of the ExclusiveStartKey and fetch the number of records, as
specified in the page length.
If the start index parameter is not present, the above steps are not
needed, we don't need to specify the ExclusiveStartKey in the scan
operation.
This solution has some drawbacks -
We will need to fetch all the keys when the user needs pagination with
start index.
We will need additional memory to store the Ids and the indexes.
Additional database scan operations ( one or multiple to fetch the
keys )
But I feel this will be very easy approach for the clients, which are using our APIs. The backward scan will work seamlessly. If the user wants to see "nth" page, this will be possible.
In fact I faced the same problem and I noticed that LastEvaluatedKey and ExclusiveStartKey are not working well especially when using Scan So I solved Like this.
GET/?page_no=1&page_size=10 =====> first page
response will contain count of records and first 10 records
retry and increase number of page until all record come.
Code is below
PS: I am using python
first_index = ((page_no-1)*page_size)
second_index = (page_no*page_size)
if (second_index > len(response['Items'])):
second_index = len(response['Items'])
return {
'statusCode': 200,
'count': response['Count'],
'response': response['Items'][first_index:second_index]
}

Resources