Can API.AI scale well for a large enterprise? - machine-learning

As the title suggests, I'd like to know if API.AI can sustain an exponentially increasing number of requests.
Let's say there is 10k requests per second, will it be able to handle that?
Besides that, is there a good way to store the ever increasing number of intents? To ensure the console browser is neat and not cluttered.
From what I see now, the only approach is to create multiple agents of different categories so that intents can be easily classified and managed, but how can I assign which agents to use in my bot codes?
Appreciate if anyone can provide some thoughts on those 3 questions.

Related

Can Server broadcast the max number of examples to every client in a train cycle in FL? Is this action an invasion of privacy?

I am training a FL model. I select 5 clients every cycle. I want to get the examples gap between a client and the maximum quantity client. Can Server broadcast the max number of examples among the 5 clients to others during this cycle? Is it legal?
In TFF it is definitely possible to implement broadcasting additional information. The API is tff.federated_broadcast and if you're looking to extend it in a Federated Averaging algorithm, re-using the simple_fedavg implementation it could probably be added near here.
Regarding whether something is an invasion pf privacy maybe useful to ask "what do other participants learn about each other?" and "should the information learned be considered sensitive?". A very strict interpretation of privacy might be "other participants learn nothing" and "all information is sensitive".
We can imagine a scenario where the server picks a maximum number of examples to process, not based on any data from the clients, and tells/broadcasts this number to each client. It seems unlikely the server or other participants would be able to learn something about an individual participant, nor something sensitive, since the number does is not derived from client data.
Alternatively, the server might first learn how many examples each client has, and then broadcast that number back to all the clients. This is definitely sharing something about one client with all other participants. It might also be sensitive, in particular if each client has a different number of examples and that number might now be used to uniquely identify a client.

how to find the abnormal id from so many ids

We run an affiliate program. Users who sign up can gain points when they successfully recruit other users. However, spammers are abusing this program, and automatically signing up large numbers of accounts. We want to prevent this from happening by closing down clearly machine-generated accounts. My idea for this is to write a program to identify machine-generated account names, or at least select a subset for manual inspection.
So far, we have found that there are two types of abnormal ids:
The first one is that there are some ids looks very similar to others, such as:
wss12345
wss12346
wss12347
test1
test2
...
The second one is that there are some ids looks like randomly generated with out rules, such as:
MiDjiSxxiDekiE
NiMjKhJixLy
DAFDAB7643
...
For the first one, I use the Levenshtein(edit) distance. This method can find out some ids, which was illustrate in type 1. (I have done this, and can get good performance)
For the second one, I can calculate the probabilty for the ids, just like:
id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)
So I can use the probability to filter out the abnormal ids. (Just an idea; I haven't tried it out.)
Can anyone give me other suggestions about this topic? How else could I approach this problem? Can you see flaws or omissions in my attempts?
Assuming that these new accounts refer back to the the recruiter's ID, I'd look at the rate and/or sheer number of new accounts associated with a given recruiter.
Some analysis on IP addresses or similar may also indicate if multiple users are coming from the same computer.
I'd use a dictionary of words, and kind of do the reverse of detecting poor passwords -- human user names should have dictionary words, personal names, lack punctuation, not include repeated characters, be mostly lower case etc.
Sort of going back to 1. above -- if a recruiter has an anamalously tight cluster of IDs, using the features you've already identified, would be a good flag. I think that this might be, essentially, #larsmans comment directly under the question.
I'd be curious to know if re-purposing password checking algorithms (item 3) provides any benefit.
You're not telling us what sort of site you are running, so this is a bit on the speculative side; but consider Stack Overflow as a prime example of successfully promoting good behavior through the use of a user reputation system, and weeding out many kinds of unwanted behaviors.
A quick, hackish fix might be to progressively deduct from the score when the amount of dormant recruit accounts grows larger, but a more rewarding and compelling fix is to award higher reputation scores for actually contributing to the site's content. However, this depends on the type of site you have; a stock market tips site, say, obviously works quite differently from a techical discussion forum.

Best db engine for building a web app with ranking algorithms

I've got an idea for a new web app which will involve the following:
1.) lots of raw inputs (text values) that will be stored in a db - some of which contribute as signals to a ranking algorithm
2.) data crunching & analysis - a series of scripts will be written which together form an algorithm that will take said raw inputs from 1.) and then store a series of ranking values for these inputs.
Events 1.) and 2.) are independent of each other. Event 2 will probably happen once or twice a day. Event 1 will happen on an ongoing basis.
I initially dabbled with the idea of writing the whole thing in node.js sitting on top of mongodb as I will curious to try out something new and while I think node.js would be perfect for event 1.) I don't think it will work well for the event 2.) outlined above.
I'd also rather keep everything in one domain rather than mixing node.js with something else for step 2.
Does anyone have any recommendations for what stacks work well for computational type web apps?
Should I stick with PHP or Rails/Mysql (which I already have good experience with)?
Is MongoDB/nosql constrained when it comes to computational analysis?
Thanks for your advice,
Ed
There is no reason why node.js wouldn't work.
You would just write two node applications.
One that takes input stores it in the database and renders output
and the other one crunches numbers in it's own process and is run once or twice per day.
Of course if your doing real number crunching and you need performance you wouldn't do nr 2 in node/ruby/php. You would do it in fortran (or maybe C).

How do image hosting sites enforce content policies?

I'm trying to figure out how to best implement a public data hosting service.
How do websites that let users upload pictures enforce their terms of service regarding obscene pictures? Do they use image processing algorithms to flag potential violations (too many skin-colored pixels)? I think Imageshack looks at the websites that their pictures are hotlinked on, and checks for keywords. If it detects anything porn related, then it removes the picture and bans the account. Are there other methods?
Is enforcement largely automated or is it based more on user reports?
I suppose it depends on the scale of your "public data hosting service".
If it's something small with maybe a couple hundreds pictures per day flowing in, you can moderate them on your own.
If it's a couple hundred thousands you'll need an amount of human beings sorting the weeds out. It's either a moderator team or users themselves who submit abuse reports.
Which one to go, can be dependent on your budget/financial success of your service as well as on the type of the service. If it's something simple like Rapidshare where one does not see what the other does, the chances that users will see each others content and through this notice and hopefully report unacceptable content are small. If it's something very social like Flickr you can bet on it reports will be flowing in.
I suppose you could automate something but it's almost an impossible task. You can't automatically detect porn. You can't automatically detect images violating copyrights - making footprints of copyrighting material in order to compare them with the uploaded stuff is a real challenge for companies with resources like Rapidshare, Youtube and others. For now this kind of work can effectively be done only by humans.
There are also legal issues to it. In some countries the service owner is not liable for what users contribute (well, if he's cooperative enough to delete certain content at request), in others he will get the charges himself for not having premoderated all the incoming content. Also think of this with regard to whatever and wherever you are going to launch.
I don't have links, but while it's certainly a difficult task prone to errors, software to detect improper content does exist. Or at least that's what the Security Manager at NASA told me - if if was just a means to scare me I don't know ;-)

How to prevent hackers from scraping our database? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do you stop scripters from slamming your website hundreds of times a second?
I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data.
In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.
I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.
I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?
EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)
My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.
You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.
You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).
I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Resources