I am attempting to pull several #user and #hashtags from the twitter search API. The complexity of the query is causing a 403 error. Wondering what the recommended work around for this is.
My thought is to query each term individually. So if I have 40 #users I want to get tweets for, I will make 40 queries, cache each, and then pull the data from the cache and display as one feed.
If there is an alternative method or suggestion, I would greatly appreciate any insight.
I think Twitter tweaks their search complexity algorithm and the exact details may change over time. I believe the rule-of-thumb in logical operations is about 10. To translate, each term is an OR operation. So, you might be okay with 10 terms per query and might want to experiment to see how many you can do to avoid the 403 and then back-off to give more room in case the complexity score on your query changes.
Joe
Related
As the title suggests, I'd like to know if API.AI can sustain an exponentially increasing number of requests.
Let's say there is 10k requests per second, will it be able to handle that?
Besides that, is there a good way to store the ever increasing number of intents? To ensure the console browser is neat and not cluttered.
From what I see now, the only approach is to create multiple agents of different categories so that intents can be easily classified and managed, but how can I assign which agents to use in my bot codes?
Appreciate if anyone can provide some thoughts on those 3 questions.
I would like to modify the way Cypher processes queries sent to it for pattern matching. I have read about Execution plans and how Cypher chooses the best plan with the least number of operations and all. This is pretty good. However I am looking into implementing a Similarity Search feature that allows you to specify a Query graph that would be matched if not exact, close (similar). I have seen a few examples of this in theory. I would like to implement something of this sort for Neo4j. Which I am guessing would require a change in how the Query Engine deals with queries sent to it. Or Worse :)
Here are some links that demonstrate the idea
http://www.cs.cmu.edu/~dchau/graphite/graphite.pdf
http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper72.pdf
I am looking for ideas. Anything at all in relation to the topic would be helpful. Thanks in advance
(:I)<-[:NEEDING_HELP_FROM]-(:YOU)
From my point of view, better for you is to create Unmanaged Extensions.
Because you can create you own custom functionality into Neo4j server.
You are not able to extend Cypher Language without your own fork of source code.
I'm trying to build a vertical (meta) search engine for a particular industry. I'm trying to do somthing similar to "indeed.com" (job search engine) and "hotelscombined.com" (hotel search engine). I would like to know how do these two search engines build up their search results?
1) Is it using APIs of the other websites they serve results from? (seems odd to me since some results come from small and primitive sites).
2) Do other website post updates to these search engines? (Also seems odd as above)
3) Do they internally understand and create a map for each website they serve results from? (if so, then maybe they need to constantly monitor the structure of these sites for any changes. Seems error prone to me).
4) Any other possibilities?
I don't know even where to start, so any pointers in the right direction is much appreciated. (books, tutorials, hints, ideas...)
Thanks
It is mostly a mix of 1 and 3. Ideally, the site will have some sort of API they expose and document. If not, you have to do data scraping. Basically, you reverse-engineer their page. If they get results asynchronously via an undocumented API, you can use that API as well as (until they make a breaking change). Otherwise, it's simply a matter of pulling the text straight out of the HTML.
I don't know of any more advanced techniques since I don't do this myself, but several of my acquaintances have gone on to work on mobile apps that need to do this sort of thing with sports scores and such (not for searching, but same requirements - get someone else's data into our database). The low tech "pull it from the HTML until they change the HTML and break everything" is standard practice where they work.
2 is possible, but to do it you have to either make business arrangements with every source of data you want to use, or gain enough market presence for everyone to want to upload their data.
Also, you don't do this while actually searching (unless you have other constraints as Charles Duffy points out in his comment). You run a process that regularly goes out, gets all the data it can find, and inserts it into your own database, which you then search. This allows you to decouple data gathering from data searching - your search page won't have to know about and handle errors from the scraper, and the scraper has to only "get all the data" from each source instead of being able to transform queries from your site to search each source.
I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.
We run an affiliate program. Users who sign up can gain points when they successfully recruit other users. However, spammers are abusing this program, and automatically signing up large numbers of accounts. We want to prevent this from happening by closing down clearly machine-generated accounts. My idea for this is to write a program to identify machine-generated account names, or at least select a subset for manual inspection.
So far, we have found that there are two types of abnormal ids:
The first one is that there are some ids looks very similar to others, such as:
wss12345
wss12346
wss12347
test1
test2
...
The second one is that there are some ids looks like randomly generated with out rules, such as:
MiDjiSxxiDekiE
NiMjKhJixLy
DAFDAB7643
...
For the first one, I use the Levenshtein(edit) distance. This method can find out some ids, which was illustrate in type 1. (I have done this, and can get good performance)
For the second one, I can calculate the probabilty for the ids, just like:
id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)
So I can use the probability to filter out the abnormal ids. (Just an idea; I haven't tried it out.)
Can anyone give me other suggestions about this topic? How else could I approach this problem? Can you see flaws or omissions in my attempts?
Assuming that these new accounts refer back to the the recruiter's ID, I'd look at the rate and/or sheer number of new accounts associated with a given recruiter.
Some analysis on IP addresses or similar may also indicate if multiple users are coming from the same computer.
I'd use a dictionary of words, and kind of do the reverse of detecting poor passwords -- human user names should have dictionary words, personal names, lack punctuation, not include repeated characters, be mostly lower case etc.
Sort of going back to 1. above -- if a recruiter has an anamalously tight cluster of IDs, using the features you've already identified, would be a good flag. I think that this might be, essentially, #larsmans comment directly under the question.
I'd be curious to know if re-purposing password checking algorithms (item 3) provides any benefit.
You're not telling us what sort of site you are running, so this is a bit on the speculative side; but consider Stack Overflow as a prime example of successfully promoting good behavior through the use of a user reputation system, and weeding out many kinds of unwanted behaviors.
A quick, hackish fix might be to progressively deduct from the score when the amount of dormant recruit accounts grows larger, but a more rewarding and compelling fix is to award higher reputation scores for actually contributing to the site's content. However, this depends on the type of site you have; a stock market tips site, say, obviously works quite differently from a techical discussion forum.