python: how to fetch an url? (with improper response headers)

python: how to fetch an url? (with improper response headers) - url

I want to build a small script in python which needs to fetch an url. The server is a kind of crappy though and replies pure ASCII without any headers.
When I try:
import urllib.request
response = urllib.request.urlopen(url)
print(response.read())
I obtain a http.client.BadStatusLine: 100 error because this isn't a properly formatted HTTP response.
Is there another way to fetch an url and get the raw content, without trying to parse the response?
Thanks

It's difficult to answer your direct question without a bit more information; not knowing exactly how the (web) server in question is broken.
That said, you might try using something a bit lower-level, a socket for example. Here's one way (python2.x style, and untested):
#!/usr/bin/env python
import socket
from urlparse import urlparse
def geturl(url, timeout=10, receive_buffer=4096):
parsed = urlparse(url)
try:
host, port = parsed.netloc.split(':')
except ValueError:
host, port = parsed.netloc, 80
sock = socket.create_connection((host, port), timeout)
sock.sendall('GET %s HTTP/1.0\n\n' % parsed.path)
response = [sock.recv(receive_buffer)]
while response[-1]:
response.append(sock.recv(receive_buffer))
return ''.join(response)
print geturl('http://www.example.com/') #<- the trailing / is needed if no
other path element is present
And here's a stab at a python3.2 conversion (you may not need to decode from bytes, if writing the response to a file for example):
#!/usr/bin/env python
import socket
from urllib.parse import urlparse
ENCODING = 'ascii'
def geturl(url, timeout=10, receive_buffer=4096):
parsed = urlparse(url)
try:
host, port = parsed.netloc.split(':')
except ValueError:
host, port = parsed.netloc, 80
sock = socket.create_connection((host, port), timeout)
method = 'GET %s HTTP/1.0\n\n' % parsed.path
sock.sendall(bytes(method, ENCODING))
response = [sock.recv(receive_buffer)]
while response[-1]:
response.append(sock.recv(receive_buffer))
return ''.join(r.decode(ENCODING) for r in response)
print(geturl('http://www.example.com/'))
HTH!
Edit: You may need to adjust what you put in the request, depending on the web server in question. Guanidene's excellent answer provides several resources to guide you on that path.

What you need to do in this case is send a raw HTTP request using sockets.
You would need to do a bit of low level network programming using the socket python module in this case. (Network sockets actually return you all the information sent by the server as it as, so you can accordingly interpret the response as you wish. For example, the HTTP protocol interprets the response in terms of standard HTTP headers - GET, POST, HEAD, etc. The high-level module urllib hides this header information from you and just returns you the data.)
You also need to have some basic information about HTTP headers. For your case, you just need to know about the GET HTTP request. See its definition here - http://djce.org.uk/dumprequest, see an example of it here - http://en.wikipedia.org/wiki/HTTP#Example_session. (If you wish to capture live traces of HTTP requests sent from your browser, you would need a packet sniffing software like wireshark.)
Once you know basics about socket module and HTTP headers, you can go through this example - http://coding.debuntu.org/python-socket-simple-tcp-client which tells you how to send a HTTP request over a socket to a server and read its reply back. You can also refer to this unclear question on SO.
(You can google python socket http to get more examples.)
(Tip: I am not a Java fan, but still, if you don't find enough convincing examples on this topic under python, try finding it under Java, and then accordingly translate it to python.)

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

Related

Do a http request from lua before haproxy routing a request

I have a Lua proxy that needs to route requests. Each request destination is established based on the response from another HTTP request with a header from the initial request. My understanding is that HAProxy is an event-driven software, so blocking system calls are absolutely forbidden and my code is blocking because is doing an HTTP request.
I read about yielding after the request but I think it won't help since the HTTP request is already started. The library for doing the request is https://github.com/JakobGreen/lua-requests#simple-requests
local requests = require('requests')
core.register_fetches('http_backend', function(txn)
local dest = txn.sf:req_fhdr('X-dest')
local url = "http://127.0.0.1:8080/service";
local response = requests.get(url.."/"+dest);
local json = response.json()
return json.field
end )
How do I convert my code to be non-blocking?

You should consider using HAProxy's SPOE which was created exactly for these blocking scenarios.

I managed to do it using Lua. The thing I was making wrong was using require('requests') this is blocking. Ideally for HA never use a Lua external library. I have to deal with plain sockets and do an HTTP request and very important to use HA core method core.tcp() instead of Lua sockets.

How to read POST requests in Lua?

I have this Telegram bot written in Lua that I am doing as a hobby for a language network. And I have been reading new messages via the getUpdates API call all the time. Now I want to rewrite it to use webhooks, but I have no experience with that whatsoever. I have googled but didn't find anything certain. I kinda feel that WSAPI is the library to use, but I am not sure. Moreover, I am not really sure I need any special library just for reading POST requests (which is all that the Telegram bot API uses). I tried using sockets:
socket = require 'socket'
server = assert(socket.bind("*", 9000))
function read(client, pattern, prefix)
local data, emsg, partial = client:receive(pattern, prefix)
if data then
return data
end
if partial and #partial > 0 then
return partial
end
return nil, emsg
end
while true do
local client = server:accept()
client:settimeout(3)
local msg, err = read(client, '*a')
if not err then
print(msg)
client:close()
end
end
The print(msg) here gives me the full POST request including headers, which I am probably able to parse (the body is supposed to always be a JSON). I am not really that familiar with HTTP requests though and I'm not sure I can just throw away everything that goes before the first {.
My setup is Lua 5.2, Ubuntu x64 16.04 and Nginx. What I need to do is to receive and read POST requests, nothing more.
TL;DR: is it okay to parse the POST request I receive from the code above or am I missing something, like a library that'd make my life easier?
Thanks!

Ruby Telnet: How to send http request using telnet

I am using ruby telnet library to make HTTP get request(http://127.0.0.1:3000/test) but i am not able to make http get request to my server.
Below is the code that i am trying
require 'net/telnet'
webserver = Net::Telnet::new('Host' => '127.0.0.1', 'Port' => 3000, 'Telnetmode' => false)
size = 0
webserver.cmd("GET / HTTP/1.1\nHost: 127.0.0.1/test") do |c|
print c
end
Please let me know what wrong i am doing here.

You need to end your input HTTP with a carriage return and line ending, otherwise the HTTP server will wait for more headers:
webserver.cmd("GET /test HTTP/1.1\r\nHost: 127.0.0.1\r\n\r\n") do |c|
print c
end
But telnet really isn't the right thing to use (unless you're just experimenting). If you want to make HTTP requests for a real-world program, you should definitely use a proper HTTP library (net/http at the least, or something like Faraday would be even better). HTTP seems simple, but there are many hidden complexities that mean creating a writer/parser from scratch is a lot of work.

Twitter stream API - Erlang client

I'm very new in Erlang world and I'm trying to write a client for the Twitter Stream API. I'm using httpc:request to make a POST request and I constantly get 401 error, I'm obviously doing something wrong with how I'm sending the request... What I have looks like this:
fetch_data() ->
Method = post,
URL = "https://stream.twitter.com/1.1/statuses/filter.json",
Headers = "Authorization: OAuth oauth_consumer_key=\"XXX\", oauth_nonce=\"XXX\", oauth_signature=\"XXX%3D\", oauth_signature_method=\"HMAC-SHA1\", oauth_timestamp=\"XXX\", oauth_token=\"XXX-XXXXX\", oauth_version=\"1.0\"",
ContentType = "application/json",
Body = "{\"track\":\"keyword\"}",
HTTPOptions = [],
Options = [],
R = httpc:request(Method, {URL, Headers, ContentType, Body}, HTTPOptions, Options),
R.
At this point I'm confident there's no issue with the signature as the same signature works just fine when trying to access the API with curl. I'm guessing there's some issue with how I'm making the request.
The response I'm getting with the request made the way demonstrated above is:
{ok,{{"HTTP/1.1",401,"Unauthorized"},
[{"cache-control","must-revalidate,no-cache,no-store"},
{"connection","close"},
{"www-authenticate","Basic realm=\"Firehose\""},
{"content-length","1243"},
{"content-type","text/html"}],
"<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"/>\n<title>Error 401 Unauthorized</title>\n</head>\n<body>\n<h2>HTTP ERROR: 401</h2>\n<p>Problem accessing '/1.1/statuses/filter.json'. Reason:\n<pre> Unauthorized</pre>\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n</body>\n</html>\n"}}
When trying with curl I'm using this:
curl --request 'POST' 'https://stream.twitter.com/1.1/statuses/filter.json' --data 'track=keyword' --header 'Authorization: OAuth oauth_consumer_key="XXX", oauth_nonce="XXX", oauth_signature="XXX%3D", oauth_signature_method="HMAC-SHA1", oauth_timestamp="XXX", oauth_token="XXX-XXXX", oauth_version="1.0"' --verbose
and I'm getting the events just fine.
Any help on this would be greatly appreciated, new with Erlang and I've been pulling my hair out on this one for quite a while.

There are several issues with your code:
In Erlang you are encoding parameters as a JSON body while with curl, you are encoding them as form data (application/x-www-form-urlencoded). Twitter API expects the latter. In fact, you get a 401 because the OAuth signature does not match, as you included the track=keyword parameter in the computation while Twitter's server computes it without the JSON body, as it should per OAuth RFC.
You are using httpc with default options. This will not work with the streaming API as the stream never ends. You need to process results as they arrive. For this, you need to pass {sync, false} option to httpc. See also stream and receiver options.
Eventually, while httpc can work initially to access Twitter streaming API, it brings little value to the code you need to develop around it to stream from Twitter API. Depending on your needs you might want to replace it a simple client directly built on ssl, especially considering it can decode HTTP packets (what is left for you is the HTTP chunk encoding).
For example, if your keywords are rare, you might get a timeout from httpc. Besides, it might be easier to update the list of keywords or your code with no downtime without httpc.
A streaming client directly based on ssl could be implemented as a gen_server (or a simple process, if you do not follow OTP principles) or even better a gen_fsm to implement reconnection strategies. You could proceed as follows:
Connect using ssl:connect/3,4 specifying that you want the socket to decode the HTTP packets with {packet, http_bin} and you want the socket to be configured in passive mode {active, false}.
Send the HTTP request packet (preferably as an iolist, with binaries) with ssl:send/2,3. It shall spread on several lines separated with CRLF (\r\n), with first the query line (GET /1.1/statuses/filter.json?... HTTP/1.1) and then the headers including the OAuth headers. Make sure you include Host: stream.twitter.com as well. End with an empty line.
Receive the HTTP response. You can implement this with a loop (since the socket is in passive mode), calling ssl:recv/2,3 until you get http_eoh (end of headers). Note down whether the server will send you data chunked or not by looking at the Transfer-Encoding response header.
Configure the socket in active mode with ssl:setopts/2 and specify you want packets as raw and data in binary format. In fact, if data is chunked, you could continue to use the socket in passive mode. You could also get data line by line or get data as strings. This is a matter of taste: raw is the safest bet, line by line requires that you check the buffer size to prevent truncation of a long JSON-encoded tweet.
Receive data from Twitter as messages sent to your process, either with receive (simple process) or in handle_info handler (if you implemented this with a gen_server). If data is chunked, you shall first receive the chunk size, then the tweets and the end of the chunk eventually (cf RFC 2616). Be prepared to have tweets that spread on several chunks (i.e. maintain some kind of buffer). The best here is to do the minimum decoding in this process and send tweets to another process, possibly in binary format.
You should also handle errors and socket being closed by Twitter. Make sure you follow Twitter's guidelines for reconnection.

Streaming Results from Mochiweb

I have written a web-service using Erlang and Mochiweb. The web service returns a lot of results and takes some time to finish the computation.
I'd like to return results as soon as the program finds it, instead of returning them when it found them all.
edit:
i found that i can use a chunked request to stream result, but seems that i can't find a way to close the connection. so any idea on how to close a mochiweb request?

To stream data of yet unknown size with HTTP 1.1 you can use HTPP chunked transfer encoding. In this encoding each chunk of data prepended by its size in hexadecimal. Last chunk is a zero-length chunk, with the chunk size coded as 0, but without any data.
If client doesn't support HTTP 1.1 server can send data as binary chunks and close connection at the end of the stream.
In MochiWeb it's all works as following:
HTTP response should be started with Response = Request:respond({Code, ResponseHeaders, chunked}) function. (By the way, look at the code comments);
Then chunks can be send to client with Response:write_chunk(Data) function. To indicate client the end of the stream chunk of zero length should be sent: Response:write_chunk(<<>>).
When handling of current request is over MochiWeb decides should connection be closed or can be reused by HTTP persistent connection.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart