I have a c# application that downloads multiple tiny files from websites (torrents). Some sites restrict the number of downloads per IP per day.
I do a HttpWebRequest and if the stream is a valid torrent, I save it to disk.
Is there a way for my c# application to spoof my IP when performing the HttpWebRequest, so that the download will not fail ?
I spaced out the download time to one per 10 minutes, but no luck. I still get blocked eventually.
I have heard that "TOR" can use diffrent IPs, but I don't want the people using my desktop app to have to install TOR browser separately.
HttpWebResponse resp = null;
try
{
var req = (HttpWebRequest)WebRequest.Create("http://www.exampe.com/test.torrent);
req.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Timeout = 30000;
req.KeepAlive = true;
resp = (HttpWebResponse)(req.GetResponse());
}
Any solutions ?
To do so, you need to manipulate tcp/ip packets. This means that you need to capture the outgoing packet created by HttpWebRequest and change its source IP to the spoofed one.
I found this forum post that seemingly has to do with what you want to do, check it out : http://pcapdotnet.codeplex.com/discussions/349978
As far as I know you can do it through PCap.net or SharpPcap libraries.
Related
I'm working on a website to load multiple youtube channels live streams. At first i was trying to figure out a way to do this without utilizing youtube's api but have decided to give in.
To find whether a channel is live streaming and to get the live stream links I've been using:
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId={CHANNEL_ID}&eventType=live&maxResults=10&type=video&key={API_KEY}
However with the minimum quota being 10000 and each search being worth 100, Im only able to do about 100 searches before I exceed my quota limit which doesn't help at all. I ended up exceeding the quota limit in about 10 minutes. :(
Does anyone know of a better way to figure out if a channel is currently live streaming and what the live stream links are, using as minimal quota points as possible?
I want to reload youtube data for each user every 3 minutes, save it into a database, and display the information using my own api to save server resources as well as quota points.
Hopefully someone has a good solution to this problem!
If nothing can be done about links just determining if the user is live without using 100 quota points each time would be a big help.
Since the question only specified that Search API quotas should not be used in finding out if the channel is streaming, I thought I would share a sort of work-around method. It might require a bit more work than a simple API call, but it reduces API quota use to practically nothing:
I used a simple Perl GET request to retrieve a Youtube channel's main page. Several unique elements are found in the HTML of a channel page that is streaming live:
The number of live viewers tag, e.g. <li>753 watching</li>. The LIVE NOW
badge tag: <span class="yt-badge yt-badge-live" >Live now</span>.
To ascertain whether a channel is currently streaming live requires a simple match to see if the unique HTML tag is contained in the GET request results. Something like: if ($get_results =~ /$unique_html/) (Perl). Then, an API call can be made only to a channel ID that is actually streaming, in order to obtain the video ID of the stream.
The advantage of this is that you already know the channel is streaming, instead of using thousands of quota points to find out. My test script successfully identifies whether a channel is streaming, by looking in the HTML code for: <span class="yt-badge yt-badge-live" > (note the weird extra spaces in the code from Youtube).
I don't know what language OP is using, or I would help with a basic GET request in that language. I used Perl, and included browser headers, User Agent and cookies, to look like a normal computer visit.
Youtube's robots.txt doesn't seem to forbid crawling a channel's main page, only the community page of a channel.
Let me know what you think about the pros and cons of this method, and please comment with what might be improved rather than disliking if you find a flaw. Thanks, happy coding!
2020 UPDATE
The yt-badge-live seems to have been deprecated, it no longer reliably shows whether the channel is streaming. Instead, I now check the HTML for this string:
{"text":" watching"}
If I get a match, it means the page is streaming. (Non-streaming channels don't contain this string.) Again, note the weird extra whitespace. I also escape all the quotation marks since I'm using Perl.
Here are my two suggestions:
Check my answer where I explain how you can check how retrieve videos from channels who are livesrteaming.
Another option could be use the following URL and somehow make request(s) each time for check if there's a livestreaming.
https://www.youtube.com/channel/<CHANNEL_ID>/live
Where CHANNEL_ID is the channel id you want check if that channel is livestreaming1.
1 Just notice that maybe the URL wont work in all channels (and that depends of the channel itself).
For example, if you check the channel_id UC7_YxT-KID8kRbqZo7MyscQ - link to this channel livestreaming - https://www.youtube.com/channel/UC4nprx9Vd84-ly7N-1Ce6Og/live, this channel will show if he is livestreaming, but, with his channel id UC4nprx9Vd84-ly7N-1Ce6Og - link to this channel livestreaming -, it will show his main page instead.
Adding to the answer by Bman70, I tried eliminating the need of making a costly search request after knowing that the channel is streaming live. I did this using two indicators in the HTML response from channels page who are streaming live.
function findLiveStreamVideoId(channelId, cb){
$.ajax({
url: 'https://www.youtube.com/channel/'+channelId,
type: "GET",
headers: {
'Access-Control-Allow-Origin': '*',
'Accept-Language': 'en-US, en;q=0.5'
}}).done(function(resp) {
//one method to find live video
let n = resp.search(/\{"videoId[\sA-Za-z0-9:"\{\}\]\[,\-_]+BADGE_STYLE_TYPE_LIVE_NOW/i);
//If found
if(n>=0){
let videoId = resp.slice(n+1, resp.indexOf("}",n)-1).split("\":\"")[1]
return cb(videoId);
}
//If not found, then try another method to find live video
n = resp.search(/https:\/\/i.ytimg.com\/vi\/[A-Za-z0-9\-_]+\/hqdefault_live.jpg/i);
if (n >= 0){
let videoId = resp.slice(n,resp.indexOf(".jpg",n)-1).split("/")[4]
return cb(videoId);
}
//No streams found
return cb(null, "No live streams found");
}).fail(function() {
return cb(null, "CORS Request blocked");
});
}
However, there's a tradeoff. This method confuses a recently ended stream with currently live streams. A workaround for this issue is to get status of the videoId returned from Youtube API (costs a single unit from your quota).
I found youtube API to be very restrictive given the cost of search operation. Apparently the accepted answer did not work for me as I found the string on non live streams as well. Web scraping with aiohttp and beautifulsoup was not an option since the better indicators required javascript support. Hence I turned to selenium. I looked for the css selector
#info-text
and then search for the string Started streaming or with watching now in it.
To reduce load on my tiny server that would have otherwise required lot more resources, I moved this test of functionality to a heroku dyno with a small flask app.
# import flask dependencies
import os
from flask import Flask, request, make_response, jsonify
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
base = "https://www.youtube.com/watch?v={0}"
delay = 3
# initialize the flask app
app = Flask(__name__)
# default route
#app.route("/")
def index():
return "Hello World!"
# create a route for webhook
#app.route("/islive", methods=["GET", "POST"])
def is_live():
chrome_options = Options()
chrome_options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--remote-debugging-port=9222')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'), chrome_options=chrome_options)
url = request.args.get("url")
if "youtube.com" in url:
video_id = url.split("?v=")[-1]
else:
video_id = url
url = base.format(url)
print(url)
response = { "url": url, "is_live": False, "ok": False, "video_id": video_id }
driver.get(url)
try:
element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#info-text")))
result = element.text.lower().find("Started streaming".lower())
if result != -1:
response["is_live"] = True
else:
result = element.text.lower().find("watching now".lower())
if result != -1:
response["is_live"] = True
response["ok"] = True
return jsonify(response)
except Exception as e:
print(e)
return jsonify(response)
finally:
driver.close()
# run the app
if __name__ == "__main__":
app.run()
You'll however need to add the following buildpacks in settings
https://github.com/heroku/heroku-buildpack-google-chrome
https://github.com/heroku/heroku-buildpack-chromedriver
https://github.com/heroku/heroku-buildpack-python
Set the following Config Vars in settings
CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google-chrome
You can find supported python runtime here but anything below python 3.9 should be good since selenium had problems with improper use of is operator
I hope youtube will provide better alternatives than workarounds.
I know this is a old thread, but i thought i share my way of checking to for example grab the status code to use in an app.
This is for a single Channel, but you could easly do a foreach with it.
<?php
#####
$ytchannelID = "UCd0BTXriKLvOs1ANx3puZ3Q";
#####
$ytliveurl = "https://www.youtube.com/channel/".$ytchannelID."/live";
$ytchannelLIVE = '{"text":" watching now"}';
$contents = file_get_contents($ytliveurl);
if ( strpos($contents, $ytchannelLIVE) !== false ){http_response_code(200);} else {http_response_code(201);}
unset($ytliveurl);
?>
Adding onto the other answers here, I use a GET request to https://www.youtube.com/c/<CHANNEL_NAME>/live and then search for "isLive":true (rather than {"text":" watching"})
I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q=CGMSBFMKrI0YiJHfqgUiGQDxp4NLfGBv6zgPSjfyQ9LBi5F-K1EbGwQ
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
I know it is linked with simple google protection against robots. How I can improve my connection
Connection connection =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.followRedirects(true);
to not have temporary ban? I know there is a way to check response, like this:
Connection.Response response =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if (statusCode == 200) { ... }
else if (statusCode == 503) { do recconect magic}
But what should I do, when I got 503 error? Have I to use proxy? Random wait time beetween connections? I hope there is better idea than saving my results in file, do manual hard-restart of router and try with new IP :P
You have already provided your own answers...
Have I to use proxy?
Of course. You should already have setup a bunch of proxies for your wrawling activity.
Random wait time beetween connections?
Yes. Use some random wait between 3000 and 5000 ms.
Alternatively, you could use an online captcha service resolving if you hit the URL https://ipv4.google.com/sorry/IndexRedirect.... Don't hit it too often or you'll get banned.
Happy coding :)
How can I resist the bad unidentified bots to crawl my website? Some bad bots whose name is not present in cPanel of Apache are badly accessing my website bandwidth.
I had tried robots.txt on batgap.com/robots.txt and also blocked with .htaccess but there is no improvement in bandwidth usage. I don't know the IP of those bots so unable to block them by IP address. These bots are consuming too much bandwidth of site and hence a result I need to increase it from server.
I'm from Incapsula and we deal with bad bots on a regular basis.
We've recently release a bot-related research that provides insights of the scope of the problem ( http://www.incapsula.com/the-incapsula-blog/item/225-what-google-doesnt-show-you-31-of-website-traffic-can-harm-your-business ) and in light of this data I have to agree with #Leonard Challis - you simply can not handle bot protection manually.
Having said that, there are bot protection solutions, even Free ones (us included) that can help you with bad bots.
BTW - Just like you mentioned, one byproduct of bad bots visits is a loss of bandwidth.
We`ve recently became aware of just how surprisingly HUGE bot-related bandwidth usage really is.
This is an interesting topic by itself.
We believe that by avoiding bad bot traffic, hosting providers can actually greatly improve their efficiency (hopefully using this to drop cost or to improve services). Once you imagine Social and Business implication of this you can understand the real scope of this bad bot problem that goes way beyond the immediate damage done.
I block 'bad bots' by using PHP.
I filter in IP address primarily, then by User-Agent secondarily.
I make the 'bad bot' wait for up to 999 seconds, then return a very small web page.
Usually (always) the internet connection times-out and zero (0) bytes are returned.
Best of all I have delayed them for a few minutes before the get to the next victim.
http://gelm.net/How-to-block-Baidu-with-PHP.htm
Unfortunately robots.txt is sometimes ignored by these "bad bots", though if the problem is more things like genuine search engine spiders that you don't want to see they ought to take it in to account. I presume with CPanel you can get in to the web server (apache) logs? In there you can look for two things: the IP and the User-Agent. You can find the culprits in there and add them to your robots.txt and .htaccess. Note that .htaccess rules denying IP addresses are far better that just relying on robots.txt because you are taking the choice out of the bot creator's hands.
If you know specific bots which are doing this you should be able to get IP addresses and user-agents from forums, but if it's a more general thing then really I'm afraid it's more of a manual job.
There are other methods that can be used with varying effect, such as mod_security (http://www.askapache.com/htaccess/modsecurity-htaccess-tricks.html) but this will mean you'll have to access your web server configuration.
Finally, you can check the links that are pointing to your web site (using the link: option on google). Sometimes if you have links on spammy forums or the like this can increase the chances of bots coming to get you. Maybe you can look at the referer URL in the apache logs - but this is all based on a lot of presumptions and you'd probably be lucky if it had a great effect.
Block Unwanted Robots/Spiders visitors via PHP
Instructions:
Place the following PHP Code in the beginning of your index.php file.
The idea here is to place the code in the main site's PHP home page, the main entry point of the site.
If you have other PHP files that are accessed directly via an URL (not including PHP include or require support type files), then place the code in the beginning of those files.
For most PHP sites and PHP CMS sites, the root's index.php file is the file that is the main entry point of the site.
Keep in mind that your site statistics, i.e. AWStats, will still log the hits under Unknown robot (identified by 'bot' followed by a space or one of the following characters _+:,.;/-), but these bots will be blocked from accessing your site's content.
<?php
// ---------------------------------------------------------------------------------------------------------------
// Banned IP Addresses and Bots - Redirects banned visitors who make it past the .htaccess and or robots.txt files to an URL.
// The $banned_ip_addresses array can contain both full and partial IP addresses, i.e. Full = 123.456.789.101, Partial = 123.456.789. or 123.456. or 123.
// Use partial IP addresses to include all IP addresses that begin with a partial IP addresses. The partial IP addresses must end with a period.
// The $banned_bots, $banned_unknown_bots, and $good_bots arrays should contain keyword strings found within the User Agent string.
// The $banned_unknown_bots array is used to identify unknown robots (identified by 'bot' followed by a space or one of the following characters _+:,.;/\-).
// The $good_bots array contains keyword strings used as exemptions when checking for $banned_unknown_bots. If you do not want to utilize the $good_bots array such as
// $good_bots = array(), then you must remove the the keywords strings 'bot.','bot/','bot-' from the $banned_unknown_bots array or else the good bots will also be banned.
$banned_ip_addresses = array('41.','64.79.100.23','5.254.97.75','148.251.236.167','88.180.102.124','62.210.172.77','45.','195.206.253.146');
$banned_bots = array('.ru','AhrefsBot','crawl','crawler','DotBot','linkdex','majestic','meanpath','PageAnalyzer','robot','rogerbot','semalt','SeznamBot','spider');
$banned_unknown_bots = array('bot ','bot_','bot+','bot:','bot,','bot;','bot\\','bot.','bot/','bot-');
$good_bots = array('Google','MSN','bing','Slurp','Yahoo','DuckDuck');
$banned_redirect_url = 'http://english-1329329990.spampoison.com';
// Visitor's IP address and Browser (User Agent)
$ip_address = $_SERVER['REMOTE_ADDR'];
$browser = $_SERVER['HTTP_USER_AGENT'];
// Declared Temporary Variables
$ipfound = $piece = $botfound = $gbotfound = $ubotfound = '';
// Checks for Banned IP Addresses and Bots
if($banned_redirect_url != ''){
// Checks for Banned IP Address
if(!empty($banned_ip_addresses)){
if(in_array($ip_address, $banned_ip_addresses)){$ipfound = 'found';}
if($ipfound != 'found'){
$ip_pieces = explode('.', $ip_address);
foreach ($ip_pieces as $value){
$piece = $piece.$value.'.';
if(in_array($piece, $banned_ip_addresses)){$ipfound = 'found'; break;}
}
}
if($ipfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
// Checks for Banned Bots
if(!empty($banned_bots)){
foreach ($banned_bots as $bbvalue){
$pos1 = stripos($browser, $bbvalue);
if($pos1 !== false){$botfound = 'found'; break;}
}
if($botfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
// Checks for Banned Unknown Bots
if(!empty($good_bots)){
foreach ($good_bots as $gbvalue){
$pos2 = stripos($browser, $gbvalue);
if($pos2 !== false){$gbotfound = 'found'; break;}
}
}
if($gbotfound != 'found'){
if(!empty($banned_unknown_bots)){
foreach ($banned_unknown_bots as $bubvalue){
$pos3 = stripos($browser, $bubvalue);
if($pos3 !== false){$ubotfound = 'found'; break;}
}
if($ubotfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
}
}
// ---------------------------------------------------------------------------------------------------------------
?>
I have a need to store files on Amazon AWS S3, but in order to isolate the user from the AWS authentication I want to go via an ASP page on my site, which the user will be logged into. So:
The application sends the file using the Delphi Indy library TidHTTP.Put (FileStream) routine to the ASP page, along with some authentication stuff (mine, not AWS) on the querystring.
The ASP page checks the auth details and then if OK stores the file on S3 using my Amazon account.
Problem I have is: how do I access the data coming in from the Indy PUT using JScript in the ASP page and pass it on to S3. I'm OK with AWS signing, etc, it's just the nuts and bolts of connecting the two bits (the incoming request and the outgoing AWS request) ...
TIA
R
A HTTP PUT will store the file at the given location in the HTTP header - it "requests that the enclosed entity be stored under the supplied Request-URI".
The disadvantage with the PUT method is that if you are on a shared hosting environment it may not be available to you.
So if the web server supports PUT, the file should be available at the given location in the the (virtual) file system. The PUT request will be handled by the server and not ASP:
In the case of PUT, the web server
handles the request itself: there is
no room for a CGI or ASP application
to step in.
The only way for your application to
capture a PUT is to operate on the
low-level, ISAPI filter level
http://www.15seconds.com/issue/981120.htm
Are you sure you need PUT and can not use a POST, which will send the file to a URL where your ASP script can read it from the request stream?
OK, Ive got a bit further with this. Code at the ASP end is:
var PostedDataSize = Request.TotalBytes ;
var PostedData = Request.BinaryRead (PostedDataSize) ;
var PostedDataStream = Server.CreateObject ("ADODB.Stream") ;
PostedDataStream.Open ;
PostedDataStream.Type = 1 ; // binary
PostedDataStream.Write (PostedData) ;
Response.Write ("PostedDataStream.Size = " + PostedDataStream.Size + "<br>") ;
var XML = AmazonAWSPUTRequest (BucketName, AWSDestinationFileID, PostedDataStream) ;
.....
function AmazonAWSPUTRequest (Bucket, Filename, InputStream)
{
....
XMLHttp.open ("PUT", URL + FRequest, false) ;
XMLHttp.setRequestHeader (....
XMLHttp.setRequestHeader (....
...
Response.Write ("InputStream.Size = " + InputStream.Size + "<br>") ;
XMLHttp.send (InputStream) ;
So I use BinaryRead, write it to a binary stream. If I write out the size of the stream I get the size of the file I POST'ed from my application, so I reckon the data is in there somewhere. I then call a routine (with the stream as a parameter) which sets up the AWS authentication/signing and does a PUT.
The AWS call returns no errors and a file of the correct name is created in the right place, but it has a size of zero! InputStream.Size has a value the same as the stream parameter passed to the routine - i.e. the size of the original file.
Any ideas?
POSTSCRIPT. Found the problem. It's caught me a few times with streams, this one. When you write data to a stream, don't forget to reset the stream position back to zero before trying to read from the stream again. I.e. just before the line:
XMLHttp.send (InputStream) ;
I needed to add:
InputStream.Position = 0 ;
My thanks for the interest and suggestions.
So I'm trying to build a real time monitoring tool for twitter key words using tweet sharp. I'm using the search API to collect queries every 10-15 seconds. When I make the calls, I only want to collect tweets that have appeared since the pervious update.
var twitter = FluentTwitter.CreateRequest().AuthenticateAs("username", "password").Search().Query().Containing("key word").Take(1000);
var response = twitter.Request();
currentResponseDateTime= Convert.ToDateTime(response.ResponseDate);
var messages = from m in response.AsSearchResult().Statuses
where m.CreatedDate > lastUpdateDateTime
select m;
lastUpdateDateTime = currentResponseDateTime;
My issue is that the twitter server time is different from the client times by a few seconds. I looked around and tried to get the datetime I recieved the response from the Response.ResponseDate property, but it looks like that is set based on the local computer time. I.e currentResponseDateTime is a few seconds ahead of the Twitter Server time. So I end up not collecting a few of the tweets.
Does anyone know how I can get the current server time from twitter search or REST API?
Thanks
I'm not sure how you would get the local server time of the twitter service, but one approach you could take is to store the date of the most recent twitter update seen in the "lastUpdateDateTime" field. That way, you're guaranteed to get all the messages since the last one you saw, regardless of the offset of the twitter server.
var twitter = FluentTwitter.CreateRequest().AuthenticateAs("username", "password").Search().Query().Containing("key word").Take(1000);
var response = twitter.Request();
currentResponseDateTime= Convert.ToDateTime(response.ResponseDate);
var messages = from m in response.AsSearchResult().Statuses
where m.CreatedDate > lastUpdateDateTime
select m;
lastUpdateDateTime = messages.Select(m => m.CreatedDate).Max();
Another approach (and one that Twitter recommends) is to pull the Date header from their API server's response, which provides Twitter's notion of time in GMT. This assumes that you can access the server response headers, and that depends on the method you're using to access the API.
For example, hitting https://api.twitter.com/1/help/test.json
$ lynx --dump --head https://api.twitter.com/1/help/test.json
HTTP/1.0 200 OK
Date: Tue, 22 Jan 2013 13:30:36 GMT
...
Reference: how to get the twitter server time (synchronize)? on dev.twitter.com support site.
Quoting Taylor Singletary:
The current time that Twitter "thinks" it is is returned in the "Date" HTTP header of every response to an API call you make. You can also issue a simple HTTP HEAD request to GET help/test to get the header as an initial syncing step for your app.