The application I'm debugging creates log files with many API calls logged as two events:
timestamp1 request_ip-->{$URL}
timestamp2 response_ip<--{$DATA}
I've recently started pouring the logs into ElasticSearch using LogStash (with Kibana as a web front).
Is there any way to do a search that includes nearby lines? Assume that request and response are always consecutive, if this helps.
With grep I would have done:
grep -A 1 "-->{$URL}"
How can I do the same with an existing LogStash+ElasticSearch deployment?
I think this is case of multi-line filter in LogStash. While I am not sure of exact details so I am not building RegEx for you, but your pattern would be around request_ip and "what" which connects multiple lines would be regex around response_ip. I am not sure if what I am saying is clear enough, but the documentation link should provide you the leads. Hope this helps.
Related
i'm looking for a good ping service like pingomatic.com but for general website (not necessarily a blog).
Any recommendations?
Thanks in advance!
Oh I just came across a massive list of services:
http://readymadeweb.com/2010/01/01/242-ways-to-ping-how-to-stay-on-search-engine-radar/
Search Engine Ping is the leading SEO tool on the internet that gets your website, blog or affiliate links ranked quickly and at the highest results in search engines!
recommend this service
http://statstool.com/search-engine-ping/
I have a mediawiki installation that I've customized with some of my own extensions. Here is the basic platform, pretty standard LAMP install.
Ubuntu Server
Apache 2
Mediawiki 1.15
PHP 5.2.6
MySQL 5.0.67
For the actual MW search I use Lucene (EzMwLucene). I also have custom extension that displays tabular data from a separate database within a MW page. Lucene doesn't index this info (which, in my case is actually good because it would clutter your expected search results). For this installation I didn't do anything to Lucene other than install it and wouldn't know how to customize it for my needs and it may be "too powerful".
At any rate, I need to create a search for the data in my other database. I have a master table that is updated daily based on data stored in other (normalized) tables. At the moment it is one of these searches that basically creates a SQL query based on the criteria you enter. This is a lot of work, though. I would like it to be more of a "type and submit" type search.
I don't think I need a comprehensive "cut & paste" type answer, but if anybody has something that I can google I would be very appreciative. I don't need to recreate the wheel, which is what I would be doing if I followed what I see in google.
If you would like to see my master database, let me know, I would want to sanitize it to make me more anonymous (whatever that means). Also, if you're familiar with MW and would like to see any of my extension code, again, let me know.
TL;DR: need to make a custom search feature with LAMP (displayed in Mediawiki). Any guidance appreciated.
Thanks SO!
Why do you need to add custom search? This will relate to the best answer.
For simplicity, you could use the Google Search Engine - http://www.mediawiki.org/wiki/Extension:Google_Custom_Search_Engine
Otherwise it sounds like you need to write a full-text query for the database.
Read on before you say this is a duplicate, it's not. (as far as I could see)
I want to get the county code in php from the client.
Yes I know you can do this using external sites or with the likes of "geoip_record_by_name" but I don't want to be dependent on an external site, and I can't install "pear" for php as im using shard Dreamhost hosting.
I thought I could just do something like this:
$output = shell_exec('whois '.$ip.' -H | grep country | awk \'{print $2}\'');
echo "<pre>$output</pre>";
But dreamhost seems to have an old version of whois (4.7.5), so I get this error on allot of IPs:
Unknown AS number or IP network. Please upgrade this program.
So unless someone knows how to get a binary of a newer version of whois onto dreamhost im stuck.
Or is there another way I could get the country code from the client who is loading the page?
Whois is just a client for the whois service, so technically you are still relying on an outside site. For the queries that fail, you could try falling back to another site for the query, such as hostip.info, who happen to have a decent API and seem friendly:
http://api.hostip.info/country.php?ip=4.2.2.2
returns
US
Good luck,
--jed
EDIT: #Mint Here is the link to the API on hostip.info: http://www.hostip.info/use.html
MaxMind provide a free PHP GeoIP country lookup class (there is also a free country+city lookup one).
The bit you want is what is mentioned under "Pure PHP module". This doesn't require you to install anything, or be dependent on them, nor does it need any special PHP modules installed. Just save the GeoIP data file somewhere, then use their provided class to interact with it.
Can you just install a copy of whois into your home directory and pass the full path into shell_exec? That way you're not bound to their upgrade schedule.
An alternative, somewhat extreme solution to your problem would be to:
Download the CSV format version of MaxMind's country database
Strip out the information you don't need from the CSV with a script and ...
... generate a standard PHP file which contains a data structure containing the IP address as the key and the country code as the value.
Include the resulting file in your usual project files and you now have a completely internal IP => country code lookup table.
The disadvantage is that, regularly, you would need to regenerate the PHP file from the latest version of the database. Also, it's a pretty nasty way of doing it in general and performance might not be the best :)
Consider ipcountryphp (my site, my code, my honour) as it provides a local internet-lifetime freely updated database. It's fast and fully self-contained, pluggable into anything PHP 5.3, SQLite3 and beyond. Very fast seeks and no performance penalties.
Enough with shameless self-promotion, let's get serious:
Relying on querying remote services in real-time to get visitor country can become a major bottleneck for your site's functionality depending on the response speed of the queried server. As a rule of thumb you should never query external services for real-time site functionality (like page loading). Using APIs in the background is great but when you need to query the country of each visitor before the page is rendered, you open yourself up to a world of pain. And do keep in mind you're not the only one abusing free services :)
So queries to 3rd-party services stay in the background while only local functionality that relies on no 3rd-party go into the layers there users interact with. Just my slightly performance paranoid take on this :)
PS: Above mentioned script I wrote has IPv6 support too.
Here is a site with a script i just used. The only problem is that you would probably every now and then need to regenerate IPs by yourself... which might be pain and tahts why everyone is telling you to use external API. But for me that wasnt solution as i was pulling like 50 IPs at once, which means i would probably get banned. So solution was to use my own script or to do saves to DB, but i was again pulling images from external sites. Anyway here is the site i found script on:
http://coding-talk.com/f29/country-flag-script-8882/
Here's a few:
http://api.hostip.info/get_html.php?ip=174.31.162.48&position=true
http://geoiplookup.net/geoapi.php?output=json&ipaddress=174.31.162.48
http://ip-api.com/json/174.31.162.48?callback=yourfunction
http://ipinfo.io/174.31.162.48
All return slightly different results.
here is also one of them. just change the IP to the variable:
http://api.codehelper.io/ips/?callback=codehelper_ip_callback&ip=143.3.87.193
I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex
I know the Google Search Appliance has access to this information (as this factors into the PageRank Algorithm), but is there a way to export this information from the crawler appliance?
External tools won't work because a significant portion of the content is for a corporate intranet.
Might be something available on Google, but I have never checked. I usually use the link checker provided by W3C. The W3C one can also detect redirects which is useful if your server handles 404s by redirecting instead of returning a 404 status code.
You can use Google Webmaster Tools to view, among other things, broken links on your site.
This won't show you broken links to external sites though.
It seems that this is not possible. Under Status and Reports > Crawl Diagnostics there are
2 styles of report available: the directory drill-down 'Tree View'
and the 100 URLs at a time 'List View'. Some people have tried creating programs to page through the List View
but this seems to fail after a few thousand URLs.
My advice is to use your server logs instead.
Make sure that 404 and referrer URL logging are enabled on your web server,
since you will probably want to correct the page containing the broken link.
You could then use a log file analyser to generate a broken link report.
To create an effective, long-term way of monitoring your broken links, you may want to set up a cron job to do the following:
Use grep to extract lines containing 404 entries from the server log file.
Use sed to remove everything except requested URLs and referrer URLs from every line.
Use sort and uniq commands to remove duplicates from the list.
Output the result to a new file each time so that you can monitor changes over time.
A free tool called Xenu turned out the be the weapon of choice for this task. http://home.snafu.de/tilman/xenulink.html#Download
Why not just analyze your webserver logs and look for all the 404 pages? That makes far more sense and is much more reliable.
I know this is an old question but you can use the Export URLs feature on the GSA admin console then look for URLs with a state of not_found. This will show you all the URLs that the GSA has discovered but returned it a 404 when it attempted to crawl them.