Ajax Security Question: Supplying Available usernames dynamically - asp.net-mvc

I am designing a simple registration form in ASP.net MVC 1.0
I want to allow the username to be validated while the user is typing (as per the related questions linked to below)
This is all easy enough. But what are the security implications of such a feature?
How do i avoid abuse from people scraping this to determine the list of valid usernames?
some related questions: 1, 2

To prevent against "malicious" activities on some of my internal ajax stuff, I add two GET variables one is the date (usually in epoch) then I take that date add a salt and SHA1 it, and also post that, if the date (when rehashed) does not match the hash then I drop the request otherwise fulfill it.
Of course I do the encryption before the page is rendered and pass the hash & date to the JS. Otherwise it would be meaningless.
The problem with using IP/cookie based limits is that both can be bypassed.
Using a token method with a good, cryptographically strong, salt (say something like one of Steve Gibson's "Perfect Passwords" https://www.grc.com/passwords.htm ) it would take a HUGE amount of time (on the scale of decades) before the method could reliably be predicted and there for ensures a certain amount security.

you could limit the number of requests to maybe 2 per 10 seconds or so (a real user may put in a name that is taken and modify it a bit and try again). kind of like how SO doesn't let you comment more than once every 30 seconds.
if you're really worried about it, you could take a method above and count how many times they tried in a certain time period, and if it goes above a threshold, kick them to another page.

Validated as in: "This username is already taken"? If you limit the number of requests per second it should help

One common way to solve this is simply by adding a delay in the request. If the request is sent to the server, wait 1 (or more) seconds to respond, then respond with the result (if the name is valid or not).
Adding a time barrier doesn't really effect users not trying to scrape, and you have gotten a 60-requests per minute limit for free.

Bulding on the answer provided by UnkwnTech, which is some pretty solid advice.
You could go a step further and make the client perform some of calculation to create the return hash - this could just be some simple arithmatic like subtrating a few numbers, adding the data and multiplying by 2.
The added arithmatic does mean an out-of-box username scraping script is unlikely to work and forces the client into using up greater CPU.

Related

Circumventing negative side effects of default request sizes

I have been using Reactor pretty extensively for a while now.
The biggest caveat I have had coming up multiple times is default request sizes / prefetch.
Take this simple code for example:
Mono.fromCallable(System::currentTimeMillis)
.repeat()
.delayElements(Duration.ofSeconds(1))
.take(5)
.doOnNext(n -> log.info(n.toString()))
.blockLast();
To the eye of someone who might have worked with other reactive libraries before, this piece of code
should log the current timestamp every second for five times.
What really happens is that the same timestamp is returned five times, because delayElements doesn't send one request upstream for every elapsed duration, it sends 32 requests upstream by default, replenishing the number of requested elements as they are consumed.
This wouldn't be a problem if the environment variable for overriding the default prefetch wasn't capped to minimum 8.
This means that if I want to write real reactive code like above, I have to set the prefetch to one in every transformation. That sucks.
Is there a better way?

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

Omniture: Creating Specific Context Variables

Was wondering if anyone out there can help.......
My company works in the travel industry and one of the product we provide is the function of buying a flight and hotel together.
One of the advantages of this is that sometimes a visitor can save on a hotel if they buy the package together.
What I want to be able to track is the following:
The hotel which has the saving on it (accomodation code); the saving that they will make; the price of the package that they will pay.
I am new to implementing but have been told by a colleague that I can use a context variable.
Would anyone be able to tell me how I should write this please?
Kind Regards
Yaser
Here is the document entry for Context Data Variables
For example, in the custom code section of the on-page code, within s_doPlugins or via some wrapper function that ultimately makes a s.t() or s.tl() call, you would have:
s.contextData['package.code'] = "accommodation code";
s.contextData['package.savings'] = "savings";
s.contextData['package.price'] = "price";
Then in the interface you can go to processing rules and map them to whatever props or eVars you want.
Having said that...processing rules are pretty basic at the moment, and to be honest, it's not really worth it IMO. Firstly, you have to get certified (take an exam and pass) to even access processing rules. It's not that big a deal, but it's IMO a pointless hoop to jump through (tip: if you are going to go ahead and take this step, be sure to study up on more than just processing rules. Despite the fact that the exam/certification is supposed to be about processing rules, there are several questions that have little to nothing to do with them)
2nd, context data doesn't show up in reports by themselves. You must assign the values to actual props/eVars/events through processing rules (or get ClientCare to use them in a vista rule, which is significantly more powerful than a processing rule, but costs lots of money)
3rd, the processing rules are pretty basic. Seriously, you're limited to just simple stuff like straight duping, concatenating values, etc.
4th, processing rules are limited in setting events, and won't let you set the products string. IOW, You can set a basic (counter) event, but not a numeric or currency event (an event with a custom value associated with it). Reason I mention this is because those price and savings values might be good as a numeric or currency event for calculated metrics. Well since you can't set an event as such via processing rules, you'd have to set the events in your page code anyways.
The only real benefit here is if you're simply looking to dupe them into a prop/eVar and that prop/eVar varies from report suite to report suite (which FYI, most people try to keep them consistent across report suites anyways, and people rarely repurpose them).
So if you are already being consistent across multiple report suites (or only have like 1 report suite in the first place), since you're already having to put some code on the site, there's no real incentive to just pop the values in the first place.
I guess the overall point here is that since the overall goal is to get the values into actual props, eVars and possibly events, and processing rules fail on a lot of levels, there's no compelling reason not to just pop them in the first place.

National Weather Service (NOAA) REST API returns nil for parameters of forecast

I am using the NWS REST API as my weather service for an app I am making. I was initially reluctant to use NWS because of its bad documentation, but I couldn't resist as it is offered completely free.
Now that I am trying to use it, I am running into some difficulty. When making a request for multiple days, the minimum temperature appears nil for several days.
(EDIT: As I have been testing the API more I have found that it is not always the minimum temperatures that are nil. It can be a max temp or a precipitation, it seems completely random. If you would like to make test calls using their web interface, you can do so here: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdBrowserByDay.htm
and here: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXML.htm)
Here is an example of a request the minimum temperatures are empty: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdBrowserClientByDay.php?listLatLon=40.863235,-73.714780&format=24%20hourly&numDays=7
Surprisingly, on their website, the minimum temperatures are available:
http://forecast.weather.gov/MapClick.php?textField1=40.83&textField2=-73.70
You'll see under the Minimum temperatures that it is filled with about 5 (sometimes less, it is inconsistent) blank fields that say <value xsi:nil="true"/>
If anybody can help me it would be greatly appreciated, using the NWS API can be a little overwhelming at times.
Thanks,
The nil values, from what I can understand of the documentation, here and here, simply indicate that the data is unavailable.
Without making assumptions about NOAA's data architecture, it's conceivable that the information available via the API may differ from what their website displays.
Missing values are represented by an empty element and xsi:nil=”true” (R2.2.1).
Nil values being returned seems to involve the time period. Notice the difference between the time-layout keys (see section 5.3.2) in 1 in these requests:
k-p24h-n7-1
k-p24h-n6-1
The data times are different.
<layout-key> element
The key is derived using the following convention:
“k” stands for key.
“p24h” implies a data period length of 24 hours.
“n7” means that the number of data times is 7.
“1” is a sequential number used to keep the layout keys unique.
Here, startDate is the factor. Leaving it off includes more time and might account for some requested data not yet being available.
Per documentation:
The beginning day for which you want NDFD data. If the string is empty, the start date is assumed to be the earliest available day in the database. This input is only needed if one wants to shorten the time window data is to be retrieved for (less than entire 7 days worth), e.g. if user wants data for days 2-5.
I'm not experiencing the randomness you mention. The folks on NOAA's Yahoo! Groups forum might be able to tell you more.

Is there any point to using Any() linq expression for optimisation purposes?

I have a MVC application which returns 2 types of Json responses from 2 controller methods; AnyRemindersExist() and GetAllUserReminders(). The first returns a boolean, 2nd returns an array, both wrapped as Json.
I have a JavaScript timer checking for calendar reminders against a user. It makes the first call (AnyRemindersExist) to check whether reminders exist and whether the client should then make the 2nd call.
For example, if the result of the Json response is false from the Any() query, it doesn't then make the 2nd controller action which makes a LINQ select call. If there are reminders that exist, it then goes further and then requests them (making use of the LINQ SELECT).
Imagine a system ramped up where 100-1000s users use the system and on the client, every 30-60 seconds a request comes in to load in the reminders. Does this Any() call help in anyway in reducing load on the server?
If you're always going to get the actual values afterwards, then no - it would make more sense to have fewer requests, and just always give the full results. I very much doubt that returning no results is slower than returning an indication that there are no results.
EDIT: tvanfosson's latest comment (at the time of this writing) is worth promoting:
You can really only tell by measuring and I'd only resort to it IFF the performance of the select only option didn't meet the requirements.
That's the most important thing about performance: the value of a guess is much less than the value of test data.
I would say that it depends on how the underlying queries are translated. If the any call is translated into an indexed lookup when the select (perhaps due to a join to get related data) must do some sort of table scan, then it will save some work in the case when there are no reminders to be found. It will cause a little extra work when there are reminders. It might be useful if the majority of the calls don't result in any results.
In the general case, though, I would just select the data and only try to optimize IF that turns out to not be fast enough. The conditions under which it will actually save effort on the server are pretty narrow and might only apply if you hand-craft the SQL rather than depend on your ORM.
Any only checks to see if there is at least one item in the Collection that is being returned. Versus using something like Count > 0 which counts the total amount of items in the collection then yes this is more optimal.
If your AnyRemindersExist method is operating on a similar principle then not calling a second call to the server would reduce your load.
So you are asking if not doing work the application doesn't need to do would reduce the workload on the server?
Of course. How would this answer every be "yes, doing extra work for no reason won't effect the server load".
It ultimately depends on how much faster the Any check is compared to getting the results and how often it will be false.
If the Any call takes near as long as the select then it pretty
much never makes sense.
If the Any call is much faster than the select but 90% of the
time it's true, then it probably isn't worth it (best case you
get 10% improvement, worst case it's actually more work).
If the Any call is much faster than the select and 90% of the
time it's false, then it probably makes sense to check if there
are any before actually getting results.
So the answer is it depends on your specific scenario. Ultimately you're going to need to measure both the relative performance (on different typical loads, maybe some queries are more intensive than others) as well as the frequency that there are no results to return.
Actually it should almost never make sense to check Any in this case.
If Any returns false then you don't need to grab the results.
However this means it would have returned no results anyway, so
unless your Any check is significantly faster than a select
returning 0 results, there's no added benefit here.
On the other hand, if Any returns true, then you'll need to get the
results anyway, so in this case Any is purely additional work done.

Resources