How to prevent gaming of website rewards for new visitors - reward-system

I'm about to embark on a website build where a company wants to reward new visitors with a gift. The gift has some monetary value, and I'm concerned about the site being gamed. I'm looking for ways to help reduce the chance that any one person can drain the entire gift inventory.
The plans call for an integration with Facebook, so authenticating with your FB credentials will provide at least a bit of confidence that a new visitor is actually a real person (assuming that scripting the creation of 100's of FB accounts and then authenticating with them is no simple task).
However, there is also a requirement to reward new visitors who do not have FB accounts, and this is where I'm looking for ideas. An email verification system by itself won't cut it, because it's extremely easy to obtain countless number of email address (me+1#gmail.com, me+2#gmail.com, etc). I've been told that asking for a credit card number is too much of a barrier.
Are there some fairly solid strategies or services for dealing with situations like this?
EDIT: The "gift" is virtual - like a coupon

Ultimately, this is an uphill, loosing battle. If there will be incentive to beat the system, someone will try and they will eventually succeed. (See for example: every DRM scheme ever implemented.)
That said, there are strategies to reduce the ease of gaming the system.
I wouldn't really consider FB accounts to be that secure. The barrier to creating a new FB account is probably negligibly higher than creating a new webmail account.
Filtering by IP address is bound to be a disaster. There may be thousands of users behind a proxy on a single IP address (cough, AOL), and a scammer could employ a botnet to distribute each account requests to a unique IP. It is likely to be more trouble than it is worth to preemptively block IPs, but you could analyze the requests later—for example, before actually sending the reward—to see if there's lots of suspicious behavior from an IP.
Requiring a credit card number is a good start, but you've already ruled that out. Also consider that one individual can have 10 or more card numbers between actual credit cards, debit cards, and one-time-use card numbers.
Consider sending a verification code via SMS to PSTN numbers. This will cost you some money (a few cents per message), but it also costs a scammer a decent amount of change to acquire a large number of phone numbers to receive those messages. (Depending on the value of your incentive, the cost a prepaid SIM may make it cost-prohibitive.) Of course, if a scammer already has many SMS-receiving PSTN numbers at his disposal, this won't work.

First thing I wonder is if these gifts need to be sent to a physical address. It's easy to spoof 100 email addresses or FB accounts but coming up with 100 clearly unique physical addresses is much harder, obviously.
Of course, You may be giving them an e-coupon or something so address might not be an option.
Once upon a time I wrote a pretty intense anti-gaming script for a contest judging utility. While this was many months of development and is far too complex to describe in great detail, I can outline the basic features of the script:
For one we logged every detail we could when a user applied for the contest. It was pretty easy to catch obvious similarities in accounts by factoring the average time between logins / submissions from a group of criteria (like IP, browser, etc - all things that can be spoofed so by themselves it is unreliable). In addition, I compared account credentials for obvious gaming - like acct1#yahoo.com, acct2#yahoo.com, etc. by using a combination of levenshtein distance which is not solely reliable - as well as a parsing script that broke apart the various details of the credentials and looked for patterns.
Depending on the scores of each test, we assigned a probability of gaming as well as a list of possible account matches. Then it was up to the admins to exclude them from the results.
You could go on for months refining your algorithm and never get it perfect. That's why my script only flagged accounts and did not take any automatic action.

Since you're talking about inventory, can we therefore assume your gift is an actual physical item?
If so, then delivery of the gift will require a physical address for delivery - requiring unique addresses (or, allowing duplicate addresses but flagging those users for manual review) should be a good restriction.
My premise is this: While you can theoretically run a script to create 100s of Facebook or Google accounts, exercising physical control over hundreds of distinct real world delivery locations is a whole different class of problem.

I would suggest a more 'real world' solution in stead of all the security: make it clear that it is one coupon per address. Fysical (delivery and/or payment) address. Then just do as you want, maybe limit it by email or something for the looks of it, but in the end, limit it per real end-user, not per person receiving the coupon.

Related

Cloud Run: real world experience with custom domains and issue with latency

In researching Google's Cloud Run, one issue listed is:
High request latency with custom domains when invoking from some
regions
https://cloud.google.com/run/docs/issues#latency-domains
Unfortunately it does not say which regions, only ones from which it is worse. I assume very high is in the order of many seconds.
So:
a) is it the region the app is in, or the region the user is in? So like if an app is in eu-west1, and the user is in us-east4, does that result in high latency?
b) have people got some real world examples of numbers?
c) is the only way to avoid it (given the need for a custom domain) to proxy the request to it from a CDN? Alas Cloudflare does not let you send a custom hostname, but I assume others will. Though not sure if they would permit e.g. POST requests

Where does raw geoip data come from?

This question is a general version of a more specific question asked here. However, those answers were unusable.
Question: What is the raw source for geoIP data?
Many websites will tell me where my IP is, but they all appear to be using databases from fewer than 5 companies (most are using a database from MaxMind). These companies offer limited free versions of their databases, but I'm trying to determine what they're using for their source data?
I've tried using Linux/Unix commands such as ping, traceroute, dig, whois, etc., but they don't provide predictably accurate information.
Preamble: I believe this is actually a very valid question for SO website as understanding how such things work is important to understanding how such datasets can be used in software. However the answer to this question is rather complex and full of historical remarks.
First - it is worth mentioning that there is NO unified raw geoip data. Such thing just does not exist. Second - the data for this comes from multiple resources and often is not reliable and/or outdated.
To understand how that comes to be one need to know how Internet came into existence and spread around the world. Short summary is below:
IANA is a global [non-profit] organization which manages assignment of IP blocks to regional organizations: https://www.iana.org/numbers This happens upon request and regional organization requests specified block size
Regional organizations may assign those IP blocks to either ISP directly or to country level sub-organizations (who would assign that to ISP then).
ISP assigns IP addresses to local branches etc.
From above you can easily see that:
There is no single body which is responsible for IP block assignment to this or that location
Decisions how to (and whether to) release information about which IP belongs to which location are not taken uniformly and instead each organizations decides how to (and whether do it at all) release that information
All of above creates a whole lot of mess. It takes a lot of dedication and long time to obtain, aggregate and sort this data. And this is why most up-to-date and detailed geoip datasets are commercial commodity.
Whoever takes on a challenge of building their own dataset should be able to obtain this information directly from end users (ISPs), because higher level organizations do not know to which location each IP address will be assigned. Higher level organizations only distribute IP blocks among applicants (and keep some reserve for faster processing) and it is a lowest level organizations who decide which location gets which IP address and they are not obligated to release this information publicly.
UPD:
To start building your own dataset you can begin with this list of blocks and how they are assigned

Twitter Streaming API to follow thousands of users

I'm considering using the Twitter Streaming API (public streams) to keep track of the latest tweets for many users (up to 100k). Despite having read various sources regarding the different rate limits, I still have couple of questions:
According to the documentation: The default access level allows up to 400 track keywords, 5,000 follow userids. What are the best practices to follow more the 5k users. Creating, for example, 20 applications to get 20 different access tokens?
If I follow just one single user, does the rule of thumb "You get about 1% of all tweets" indeed apply? And how does this changes if I add more users up to 5k?
Might using the REST API be a reasonable alternative somehow, e.g., by polling the latest tweets of users on a minute-by-minute basis?
What are the best practices to follow more the 5k users. Creating, for example, 20 applications to get 20 different access tokens?
You don't want to use multiple applications. This response from a mod sums up the situation well. The Twitter Streaming API documentation also specifically calls out devs who attempt to do this:
Each account may create only one standing connection to the public endpoints, and connecting to a public stream more than once with the same account credentials will cause the oldest connection to be disconnected.
Clients which make excessive connection attempts (both successful and unsuccessful) run the risk of having their IP automatically banned.
A rate limit is a rate limit--you can't get more than Twitter allows.
If I follow just one single user, does the rule of thumb "You get about 1% of all tweets" indeed apply? And how does this changes if I add more users up to 5k?
The 1% rule still applies, but it is very unlikely impossible for one user to be responsible for at least 1% of all tweet volume in a given time interval. More users means more tweets, but unless all 5k are very high-volume tweet-ers you shouldn't have a problem.
Might using the REST API be a reasonable alternative somehow, e.g., by polling the latest tweets of users on a minute-by-minute basis?
Interesting idea, but probably not. You're also rate-limited in the Search API. For GET/statuses/user_timeline, the rate limit is 180 queries per 15 minutes. You can only get the tweets for one user with this endpoint, and the regular GET/search/tweets doesn't accept user id as a parameter, so you can't take advantage of that (also 180 query/15 min rate limited).
The Twitter Streaming and REST API overviews are excellent and merit a thorough reading. Tweepy unfortunately has spotty documentation and Twython isn't too much better, but they both leverage the Twitter APIs directly so this will give you a good understanding of how everything works. Good luck!
To get past the 400 keywords and 5k followers, you need to apply for enterprise access.
Basic
400 keywords, 5,000 userids and 25 location boxes
One filter rule on one allowed connection, disconnection required to adjust rule
Enterprise
Up to 250,000 filters per stream, up to 2,048 characters each.
Thousands of rules on a single connection, no disconnection needed to add/remove rules using Rules API
https://developer.twitter.com/en/enterprise

Erlang fault-tolerant application: PA or CA of CAP?

I have already asked a question regarding a simple fault-tolerant soft real-time web application for a pizza delivery shop.
I have gotten really nice comments and answers there, but I disagree in that it is a true web service. Rather than a web service, it is more of a real-time system to accept orders from customers, control the dispatching of these orders and control the vehicles that deliver those orders in real time.
Moreover, unlike a 'true' web service this system is not intended to have many users - it is just a few dispatchers (telephone operators) and a few delivery drivers that will use it (as for now I have no requirement to provide direct access to the service to the actual customers; only the dispatchers and delivery drivers will have the direct access).
Hence this question is a bit more general.
I have found that in order to make a right choice for a NoSQL data storage option for this application first thing that I have to do is to make a choice between CA, PA and CP according to the CAP theorem.
Now, the Building Web Applications with Erlang book says that "while it [Mnesia] is not a SQL database, it is a CA database like a SQL database. It will not handle network partition". The same book says that the CouchDB database is a PA database.
Having that in mind, I think that the very first thing that I need to do with my application is to decide what the 'fault-tolerance' term means regarding to CAP.
The simple requirement that I have is to have the application available 24/7(R1). The other one is that there is no need to scale, the application will have a very modest amount of users (it is probably not possible to have thousands of dispatchers) (R2).
Now, does R1 require the application to provide Consistency, Availability and Partition Tolerance and with what priorities?
What type of data storage option will better handle the following issues:
Providing 24/7 availability for a dispatcher (a person who accepts phone calls from customers and who uses a CRM) to look up customer records and put orders into the system;
Looking up current ongoing served orders and their status (placed, baking, dispatched, delivering, delivered) in real time;
Keep track of all working vehicles' locations and their payloads in real time;
Recover any part of the system after system crash or network crash to continue providing 1,2 and 3;
To sum it up: What kind of Data Storage (CA, PA or CP) will suite the system described above better? What kind of Data Storage will better satisfy the R1 requirement?
For your 24/ requirement you are searching a database with (High) Availability because you want your requests to succeed everytime (even if they are only error results).
A netsplit would bringt your whole system down, when you have no partition tolerance
Consistency is nice to have, but you can only have 2 of 3.
Your best bet will be a PA solution. I highly recomment a solution which has been inspired by Amazon Dynamo. The best known dynamo implementations are riak and couchdb. Riak even allows you to change PA to some other form by tuning the read and write replicas.
First, don't confuse CAP "Availability" with "High Availability". They have nothing to do with each other. The A in CAP simply means "All DB nodes can answer queries". To get High Availability, you must be in multiple data centers, you must have robust documented procedures for maintenance, expansion, etc. None of that depends on your CAP choice.
Second, be realistic about your requirements. A stock-trading application might have a requirement for 100% uptime, because every second of downtime could loose millions of dollars. On the other hand, I'm guessing your pizza joint might loose tens of dollars for every minute it's down. So it doesn't make sense to spend millions trying to keep it up. Try to compute your actual costs.
Third, always evaluate your choice vs mainstream. You could just go CA (MySQL) and quickly fail-over to the slaves when problems happen. Be realistic about the costs (and risks) of building on new technology. If you really expect your system to run for 5 years without downtime, ask for proof that someone else has run that database for 5 years without downtime.
If you go "AP" and have remote people (drivers, etc.) then you'll need to write an app that stores their data on their phone and sends it in the background (with retries). Of course, you could do this regardless of weather your database was CA or AP.
If you want high uptimes, you can either:
Increase MTBF (Mean Time Between Failures) - Buy redundant power supplies, buy dual ethernet cards, etc..
Decrease MTTR (Mean Time To Recovery) - Just make sure when failure happens you can recover quickly. (Fail over to slave)
I've seen people spend tens of thousands of dollars on MTBF, only to be down for 8 hours while they restore their backup. It makes more sense to ensure MTTR is low before attacking MTBF.

Twitter Rate Limiting IP/OAuth Concerns

I have a series of webapps that collects all terms relating to a subject using the Public Streaming API. So far, I've been taking a very, very arduous route of creating a new account for each stream, setting up a new Twitter application on that account, copying the OAuth tokens, spinning up a new EC2 instance, and setting up the stream.
This allows me to have the streams coming from multiple different IPs, OAuth generation is easy with the generator tool when you create an app, and because they are each in different accounts I don't meet any account limits.
I'm wondering if there's anything I can do to speed up the process, specifically in terms of EC2 instances. Can I have a bunch of streams running off the same instance using different accounts?
If you run multiple consumers from a single machine you may be temporarily banned,
repeated bans may end up getting you banned for longer periods.
At least, this happened to me a few times in the past.
What I had found at the time was:
same credentials, same ip -> block/ban
different credentials, same ip -> mostly ok, but banned from time to time
different credentials, different ip -> ok
This was a few years ago, so I am not sure the same is still true, but I'd expect twitter to have tightened the rules, rather than having relaxed them.
(Also, I think you're infringing their ToS)
You should check the new Twitter API version 1.1. It was released a few days ago and many changes were made on how the rates are calculated.
One big change is that the IP is completely ignored now. So you don't have to create many instances anymore (profit!)
From the Twitter dev #episod:
Unlike in API v1, all rate limiting is per user per app -- IP address has no involvement in rate limiting consideration. Rate limits are completely partitioned between applications.

Resources