Is AmazonAWS creating non-genuine hits? How can I verify? - asp.net-mvc

I have a site that logs a "hit (via saving a record to a Hits table that captures the date/time and IP of the machine whenever the detail page is loaded)" whenever a user brings up a detail page for a particular item so that admins can see how many hits that particular item gets. We get random instances where items are being hit multiple times/day in twos. So, in the data, it looks like a user is viewing an item, but the site is logging their hit twice in the database (same item, same date/time, same IP Address, etc.). Most hits are only being recorded once, and all my testing has lead to assurance the site is working appropriately. I'm noticing that particular IP Addresses are causing double hits. When I do Reverse IP searches, all the "double hits" are tied to IP Addresses that trace back to Amazonaws in northern Virginia, on the other side of the country. Our site is used locally, and the single hits are coming from IPs that trace back to local areas. Is there a bot hitting my site from afar? Should I block Amazonaws in Azure (which is where my site is hosted) or is that going to lock out genuine users? Is there a way I can detect whether a hit is genuine in my code (my site is in .Net MVC)? Has anyone faced a similar situation in the past?
Note: This IS RELEVANT to software engineering because a part of the question is asking how I can verify in my code that a hit is genuine!!!!!!!!!!!!!!!!!!

Basically, what I found out (no thanks to the elitist user who downvoted my question and offered no contribution) is that, my hit counter is being inflated by web crawlers. The quick and dirty solution is to implement a robots.txt file to block crawlers from hitting that page. Of course, that comes with the sacrifice that my client's site will no longer come up, should the public do a google search for the product being offered.
One alternative is the hidden link method; in which we put a hidden page on the site that no human user would ever access. When a bot hits that page, we record the IP in a "blacklist" table. Then, before our real hit counter logs a hit, it checks the user's IP against the blacklist.
Another alternative is to implement a blacklist of known User-Agents used by bots. We check the user's credentials against that list in order to determine whether a user is a bot.
Neither of these solutions are 100% though.
These are fairly adequate responses to my question. Of course, since this is StackExchange (or StackOverflow or StackYourMomma or whatever it is), people are just going to downvote your question and act like you're beneath a response because you didn't follow all the little bull crap rules that come along with being a member of the SE/SO/SYM community.

Related

Requesting input on conceptual ideas for disguising browser history

I am working with a Domestic Violence support organisation to build a website and have been asked to provide a "Quick Exit" function.
The purpose is to enable the user to exit the site quickly without closing the browser. I have seen such buttons on similar sites and the normal scenario is that they simply cause a Google search page to be shown. (easy but doesn't hide history)
I am looking for ideas to improve on this function to hide/disguise the history stored in the browser as this is currently a fairly significant flaw with the Quick Exit buttons I've seen to date.
I had a concept but I am looking for input on either fleshing out my concept, or other alternative directions to consider.
My concept was to have two domains: let's call them dv-site.com and decoy-site.com. The former being the source of domestic violence support information and the latter being some random content, could be anything, lets just say weather information for the sake of the conversation.
If a user navigates directly to dv-site.com the server redirects to decoy-site.com but also attaches some session specific, or perhaps single use query string or similar.
decoy-site.com validates the query string and, if valid, loads dv-site.com within an iframe or something like that so from the users perspective they are just looking at dv-site.com, though the domain recorded in history is decoy-site.com.
Links within the iframe loaded site would similarly be redirected with the same or a new query string.
If a user was to click on the browser history and go directly to decoy-site.com it would not be able to validate the query string and would just load the decoy site like a normal site. i.e. just showing weather information that exist on that site.
Domestic violence is a serious systemic issue and I would love some input from anyone who has more technical knowledge than I do on fleshing out this concept.
Other aspects I am unsure of how to tackle;
ensuring that dv-site.com can get crawled and ranked by search engines, even though users are all redirected, as it is imperative that it appears in search results so it can be found
technical aspects of a redirect that does not appear in history.
I'm unsure if it's possible to do this without all content and engagement being attributed to the decoy-site..
For the redirect, I believe that HTTP redirects do not get stored in history. You can use a 302 redirect for that. HTTP has a set-cookie header that lets you record a cookie - coupled with the headers here, you can give the decoy site access without recording it in history. Then, delete the cookie.
As far as pagerank goes, you could add a line to robots.txt as described here (the last point) to force the bot to scrape using a query parameter. Then in the backend, return the dv site only if that parameter is passed, otherwise redirect. If the googlebot removes query params when publishing, it will work out. Otherwise, it might fail.
Best of luck.

How to Track my sites URLs across the web? Not using google analytics

Sorry for the simple question but I've searched through the forums for 2 hours.
Goal:
I want to track my websites URLs across the web. I'm current using google analytical to do tracking which is fine for me. But I want to show my users where their links are being clicked when they login to their account.
What's the best way I can do this? I'm using a php backed if that helps. My goal is to provide: how many times their links are clicked and where the URL was clicked from.
Edit: I was hoping somebody else would give you a better hint, but since the question doesn't seem to attract many people I'll try to give you a better answer.
So, breaking down your problem to pieces, you want to trigger a piece of code everytime a page is accessed. You should be able to do a mapping between the page and its owner at one point - when rendering the statistics. If you want to "cut corners" and have a simple solution, you simply insert into your database a new entry describing the event. You seem to be only interested in the number of times it was accessed, so for a specific pageID, you can count how many entries there is, so this data can be computed directly from your database.
Now, the interesting part, figuring out where the visitor comes from. If you take a look at the $_SERVER variable for PHP, documented here, you will see that it contains a lot of interesting information. I think you're mostly interested in 'HTTP_REFERER' which contains
The address of the page (if any) which referred the user agent to the
current page. This is set by the user agent. Not all user agents will
set this, and some provide the ability to modify HTTP_REFERER as a
feature. In short, it cannot really be trusted.
Looking quickly on Stack Overflow I didn't find a more trustable source, so when available, I guess that's the only data you can use.
If you want to make things a bit more robust, you could log the IP address of the visitor and only count 1 visit per day.
So, to recapitulate, you create a table which logs the visits, containing the pageID and referer (optionally the IP address and date to prevent multiple insertions on the same day from the same user if you want to be strict). You add a function call to each page that will record the information of your pageID, maybe the URL or an ID you use internally? and the value of $_SERVER['HTTP_REFERER']. Then, when a user wants to see his stats, you look up every pageID that belongs to him, then pull the count of entries for that page and all the referers.
I hope my explanations are clear. It's definitely not bulletproof, but it's simple and should do the job, and it can be programmed pretty quickly.
Old answer: Piwik (documented here) is pretty similar to Google Analytics. Since you own the database where the data is stored, you could probably extract all the data you want, the way you want it, and manipulate it the way you want without having to depend on someone else (Google).

Integrating twitter,facebook and other services in one single site

I need to develop an application which should help me in getting all the status,messages from different servers like Twitter,facebook etc in my application and also when i post a message it should gets updated in all the services. I am using authlogic for authentication. Can anyone suggest me what gems/plug-ins i can use..
I need API help to get all the tweets/messages to be displayed in my application and also ways to post the messages to the corresponding services by posting it from my application. Can anyone help me from design point.
Walk through what you'd want to do in your head. Imagine the working site, imagine your webapp working before you start. So your user logs in (handled by authlogic) and sees a textbox called "What are you doing right now?". The user fills in a status message and clicks "post". The status message appears at the top of their previously posted messages.
Start with the easy part. Create a class that posts to two services. Use the twitter gem and rfacebook to post to two already defined services. In the future, you'll want to let the user associate services to their account and you would iterate through the associated services and post the message to each. Once you have this working, you can refactor or polish the UI a bit to round out this feature. I personally would do the "add a social media account to my profile" feature towards the end.
Harder is the reading of the data (strangely enough) because you're going to have to figure out how to store it. You could store nothing but I suspect you'd run into API limits just searching all the time (could design around this). I would keep a little cache of posts associated to the user's social media account. In this way, the data model would look like this:
A user has many social media accounts.
A social media account has many posts. (cache)
Of course, now you need to schedule the caching of the posts. This could be done manually, based on an event (like when they login) or time based. So when the update happens, you load up the posts for that social media account and the user will see the posts the next time they hit the page. For real-time push to the client's browser while they stare at the screen, use faye (non-trivial) and ajax to pull the new posts to the top of the social media stream view.
The time based one is tricky because you'd either have to have a cron job run or have rails handle it all with a gem like clockwork. But then you have to leave rails running. I've also solved this by having a class in /lib do all the work and a simple web call kicks off the update. But it wasn't in a multi-user use case. So that might not work. In any case, you'll want to have some nice reusable code for these problems since update requests can come from many different sources.
You'll also have to deal with the API limits. When pulling down content from twitter, you won't get everything. That will just have to be known by the user or you'll have to indicate a "break in time" somehow.
The UI should be pretty easy (functionally anyway), because you know which source the post/content is coming from. It'd be easy to throw a little icon next to the post to display which social media site it's coming from.
Anyway, good luck, sounds like a fun project.

voting - stopping abuse from client side - ASP.NET MVC

so I have designed this voting thing which does not let somebody vote for the same article twice in 24 hours. However, suppose a person votes and after seeing the person was able to cast vote or that he is falling in that 24 hour window, I disable the vote-casting button (and this is all Ajax btw).
But what to do when a person closes his/her browser and comes back up or even refreshes the page? Obviously, he would not be able to cast vote, because of my algorithm, but the person would still end up succeeding in making call to the server. So if he really wanted, he would keep refreshing the page and clicking on the vote and put unnecessary load on the server. How to avoid that by doing some sort of client-side thing or something?
I am using ASP.NET MVC, so session variables are out of question.
Am I being over-concerned by this?
If voting happens only from logged in (known) members then you shouldn't have any problem.
If, on the other hand, everyone can vote then you need to store all user vote events:
timestamp
poll
poll_vote
ip
user agent
user uniqueness cookie
So you'll need a random hash sent out as cookie. This will ensure that you don't accept another vote for the same poll from the same person.
If the user deletes his cookies you fallback to plan B, where you don't allow more than (say) 10 votes from the same IP and user agent combination for 24 hours.
The system is not perfect since users can change IPs and (more easily) user agents. You'd need advanced pattern detection algorithms to detect suspicious votes. The good thing about storing all user vote events is that you can process these later on using a scheduler, or outsource the votes to someone else who can process them for you.
Good luck
Refreshing is not a problem
If you're doing all this voting using Ajax, refreshing a page won't do anything except load the page using GET.
If you're not using Ajax you should make sure you call RedirectToAction/RedirectToRoute action result, that would as well help you avoid refresh problems.
How do you recognise users
If you use some sort of user authentication this re-voting is not a problem. But if your users are plain anonymous, you should store IP address with your votes. This is how things are usually done. This makes it possible to avoid session variables as well. But you have to be aware of this technique because it's not 100% perfect.
Cookies?
You could of course also use absolute expiration cookies. They'd expire in an day. Advanced users would of course be able to avoid your voting restrictions, but they would be able to avoid other ways as well. Sessions BTW are also based on cookies anyway.
Combination
But when you'd like to make you system as great as possible, you'll probably use a combination of the above.
The best way would be to track who voted for what and when on the server (probably storing it in a database). In order to do this you must use an authentication system on your site (probably forms authentication) to identify users. So every time someone tries to vote you check first in your data storage if he already voted and when and decide whether to validate the vote or not. This is the most reliable way.
If your site is anonymous (no authentication required to vote) then you could store a persistent cookie on the client computer that will last for 24 hours and indicate that a vote has already been cast from this computer. Remember though that cookies might be disabled, removed and are not a reliable way to identify a given user.
I am using ASP.NET MVC, so session
variables are out of question.
Any reason for that? Sessions are perfectly fine in ASP.NET MVC applications. It is in your case that they won't work because if the user closes the browser he will lose the session.
Obviously, he would not be able to
cast vote, because of my algorithm,
but the person would still end up
succeeding in making call to the
server. So if he really wanted, he
would keep refreshing the page and
clicking on the vote and put
unnecessary load on the server
Automated bots could also put unnecessary load to your server which is much more important than a single user clicking on F5.
If you just want to ensure the user can only vote once on an article then you just need to store a Set (i.e. HashSet) of all article id's that they've already voted on, then just check before allowing the vote.
If you still wanted a 24hr limit then you need to store a Dictionary<articleId,DateTime> then you can check if he has already voted for that article and if he has when it was.

Prevent bot from crawling certain areas of site

I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings?
edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file.
I appreciate any help!
You can prevent some malicious clients from hitting your server too heavily by implementing throttling on the server. "Sorry, your IP has made too many requests to this server in the past few minutes. Please try again later." In practice, though, assume that you can't stop a truly malicious user from bypassing any throttling mechanisms that you put in place.
Given that, here's the more important question:
Are you comfortable with the information that you're making available for all the world to see? Are your users comfortable with this?
If the answer to those questions is no, then you should be ensuring that only authorized users are able to see the sensitive information. If the information isn't particularly sensitive but you don't want clients crawling it, throttling is probably a good alternative. Is it even likely that you're going to be crawled anyway? If not, robots.txt should be just fine.
It seems like you have 2 issues.
Firstly a concern about certain data appearing in search results. The second about malicious or unscrupulous user harvesting user related data.
The first issue will be covered by appropriate use of a robots.txt file as all the big search engines honour this.
The second issue seems more to do with data privacy. The first question which immediately springs to mind is: If there is user information which people may not want displayed, why are you making it available at all?
What is the privacy policy for such data?
Do users have the ability to control what information is made available?
If the information is potentially sensitive but important to the system could it be restricted so it is only available to logged in users?
Check out the Robots exclusion standard. It's a text file that you put on your site that tells a bot what it can and can't index. You will also want to address what happens if a bot doesn't honour the robots.txt file.
robots.txt file as mentioned. If that is not enough then you can:
Block unknown useragents - hard to maintain, easy for a bot to forge a browser's (although most legitimate bots wont)
Block unknown IP addresses - not useful for a public site
Require logins
Throttle user connections - tricky to tune, you will still be disclosing information.
Perhaps by using a combination. Either way it is a trade off, if the public can browse to it, so can a bot. Be sure you don't block & alienate people in your attempts to block bots.
a few options:
force the user to login to view the content
add a CAPTCHA page before the content
embed content in Flash
load dynamically with JavaScript

Resources