How do I save the origin html file with Apache Nutch - search-engine

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch?
Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.)

Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch).
If you want quick and easy solution for getting html pages:
If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url.
OR use HTTrack tool.
EDIT:
Writing a your own nutch plugin will be great. Your problem will get solved plus you can contribute to nutch by submitting your work !!! If you are new to nutch (in terms of code & design), then you will have to invest lot of time building a new plugin ... else its easy to do.
Few pointers for helping your initiative:
Here is a page which talks about writing own nutch plugin.
Start with Fetcher.java. See lines 647-648. That is the place where you can get the fetched content on per url basis (for those pages which got fetched successfully).
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);
You should add code right after this to invoke your plugin. Pass content object to it. By now, you would have guessed that content.getContent() is the content for url you want. Inside the plugin code, write it to some file. Filename should be based on the url name else it will be difficult to work with that. Url can be obtained by fit.url.

You must do modifications in run Nutch in Eclipse.
When you are able to run, open Fetcher.java and add the lines between "content saver" command lines.
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
updateStatus(content.getContent().length);'
//------------------------------------------- content saver ---------------------------------------------\\
String filename = "savedsites//" + content.getUrl().replace('/', '-');
File file = new File(filename);
file.getParentFile().mkdirs();
boolean exist = file.createNewFile();
if (!exist) {
System.out.println("File exists.");
} else {
FileWriter fstream = new FileWriter(file);
BufferedWriter out = new BufferedWriter(fstream);
out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
out.close();
System.out.println("File created successfully.");
}
//------------------------------------------- content saver ---------------------------------------------\\

To update this answer -
It is possible to post process the data from your crawldb segment folder, and read in the html (including other data nutch has stored) directly.
Configuration conf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(conf);
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
try
{
Text key = new Text();
Content content = new Content();
while (reader.next(key, content))
{
System.out.println(new String(content.GetContent()));
}
}
catch (Exception e)
{
}

The answers here are obsolete. Now, it is simply possible to get the plain HTML-files with nutch dump. Please see this answer.

In apache Nutch 2.3.1
You can save the raw HTML by edit the Nutch code firstly run the nutch in eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse
After you finish ruunning nutch in eclipse edit file FetcherReducer.java , add this code to the output method, run ant eclipse again to rebuild the class
Finally the raw html will added to reportUrl column in your database
if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
fit.page.setReprUrl(StringUtil.cleanField(data));
scanner.close();
}

Related

Where to save files from Firefox add-on?

I am working on a Firefox add-on which among other stuff generates thumbnails of websites for use by the add-on. So far I've been storing them by their image data URL using simple-storage. Two problems with this: the storage space is limited and sending very long strings around doesn't seem optimal(I assume the browser has optimized ways of loading image files, but maybe not data URLs). I think it shouldn't be a problem to save the files to disk, the question is where though. I googled quite a bit and could not find anything. Is there a natural place for this? Are there any restrictions?
As of Firefox 32, the place to store data for your add-on is supposed to be: [profile]/extension-data/[add-on ID]. This was established by the resolution of "Bug 915838 - Provide add-ons a standard directory to store data, settings". There is a follow-on bug, "Bug 952304 - (JSONStore) JSON storage API for addons to use in storing data and settings" which is supposed to provide an API for easy access.
For the Addon-SDK, you can obtain the addon ID (which you define in package.json) with:
let self = require("sdk/self");
let addonID = self.id;
For XUL and restartless extensions, you should be able to get the ID of your addon (which you define in the install.rdf file) with:
Components.utils.import("resource://gre/modules/Services.jsm");
let addonID = Services.appInfo.ID
You can then do the following to generate a URI for a file in that directory:
userProfileDirectoryPath = Components.classes["#mozilla.org/file/directory_service;1"]
.getService( Components.interfaces.nsIProperties)
.get("ProfD", Components.interfaces.nsIFile).path,
/**
* Generate URI for a filename in the extension's data directory under the preferences
* directory.
*/
function generateURIForFileInPrefExtensionDataDirectory (fileName) {
//Account for the path separator being OS dependent
let toReturn = "file://" + userProfileDirectoryPath.replace(/\\/g,"/");
return toReturn +"/extension-data/" + addonID + "/" + fileName;
}
}
The object myExtension.addonData is a copy that I store of the Bootstrap data provided to entry points in bootstrap.js.

Reading and saving binary image from OpenLDAP server using Groovy

I'm trying to save an image from an OpenLDAP server. It's in binary format and all my code appears to work, however, the image is corrupted.
I then attempted to do this in PHP and was successful, but I'd like to do it in a Grails project.
PHP Example (works)
<?php
$conn = ldap_connect('ldap.example.com') or die("Could not connect.\n");
ldap_set_option($conn, LDAP_OPT_PROTOCOL_VERSION, 3);
$dn = 'ou=People,o=Acme';
$ldap_rs = ldap_bind($conn) or die("Can't bind to LDAP");
$res = ldap_search($conn,$dn,"someID=123456789");
$info = ldap_get_entries($conn, $res);
$entry = ldap_first_entry($conn, $res);
$jpeg_data = ldap_get_values_len( $conn, $entry, "someimage-jpeg");
$jpeg_filename = '/tmp/' . basename( tempnam ('.', 'djp') );
$outjpeg = fopen($jpeg_filename, "wb");
fwrite($outjpeg, $jpeg_data[0]);
fclose ($outjpeg);
copy ($jpeg_filename, '/some/dir/test.jpg');
unlink($jpeg_filename);
?>
Groovy Example (does not work)
def ldap = org.apache.directory.groovyldap.LDAP.newInstance('ldap://ldap.example.com/ou=People,o=Acme')
ldap.eachEntry (filter: 'someID=123456789') { entry ->
new File('/Some/dir/123456789.jpg').withOutputStream {
it.write entry.get('someimage-jpeg').getBytes() // File is created, but image is corrupted (size also doesn't match the PHP version)
}
}
How would I tell the Apache LDAP library that "image-jpeg" is actually binary and not a String? Is there a better simple library available to read binary data from an LDAP server? From looking at the Apache mailing list, someone else had a similar issue, but I couldn't find a resolution in the thread.
Technology Stack
Grails 2.2.1
Apache LDAP API 1.0.0 M16
Have you checked whether the image attribute value is base-64 encoded?
I found the answer. The Apache Groovy LDAP library uses JNDI under the hood. When using JNDI certain entries are automatically read as binary, but if your LDAP server uses a custom name, the library will not know that it's binary.
For those people that come across this problem using Grails, here's the steps to set a specific entry to binary format.
Create a new properties file call "jndi.properties" and add it to your grails-app/conf directory (all property files in this folder are automatically included in the classpath)
Add a line in the properties file with the name of the image variable:
java.naming.ldap.attributes.binary=some_custom_image
Save the file and run the Grails application
Here is some sample code to save a binary entry to a file.
def ldap = LDAP.newInstance('ldap://some.server.com/ou=People,o=Acme')
ldap.eachEntry (filter: 'id=1234567') { entry ->
new File('/var/dir/something.jpg').withOutputStream {
it.write entry.image
}
}

Open XML SDK: opening a Word template and saving to a different file-name

This one very simple thing I can't find the right technique. What I want is to open a .dotx template, make some changes and save as the same name but .docx extension. I can save a WordprocessingDocument but only to the place it's loaded from. I've tried manually constructing a new document using the WordprocessingDocument with changes made but nothing's worked so far, I tried MainDocumentPart.Document.WriteTo(XmlWriter.Create(targetPath)); and just got an empty file.
What's the right way here? Is a .dotx file special at all or just another document as far as the SDK is concerned - should i simply copy the template to the destination and then open that and make changes, and save? I did have some concerns if my app is called from two clients at once, if it can open the same .dotx file twice... in this case creating a copy would be sensible anyway... but for my own curiosity I still want to know how to do "Save As".
I would suggest just using File.IO to copy the dotx file to a docx file and make your changes there, if that works for your situation. There's also a ChangeDocumentType function you'll have to call to prevent an error in the new docx file.
File.Copy(#"\path\to\template.dotx", #"\path\to\template.docx");
using(WordprocessingDocument newdoc = WordprocessingDocument.Open(#"\path\to\template.docx", true))
{
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//manipulate document....
}
While M_R_H's answer is correct, there is a faster, less IO-intensive method:
Read the template or document into a MemoryStream.
Within a using statement:
open the template or document on the MemoryStream.
If you opened a template (.dotx) and you want to store it as a document (.docx), you must change the document type to WordprocessingDocumentType.Document. Otherwise, Word will complain when you try to open the document.
Manipulate your document.
Write the contents of the MemoryStream to a file.
For the first step, we can use the following method, which reads a file into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Then, we can use that in the following way (replicating as much of M_R_H's code as possible):
// Step #1 (note the using declaration)
using MemoryStream stream = ReadAllBytesToMemoryStream(#"\path\to\template.dotx");
// Step #2
using (WordprocessingDocument newdoc = WordprocessingDocument.Open(stream, true)
{
// You must do the following to turn a template into a document.
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
// Manipulate document (completely in memory now) ...
}
// Step #3
File.WriteAllBytes(#"\path\to\template.docx", stream.GetBuffer());
See this post for a comparison of methods for cloning (or duplicating) Word documents or templates.

Open pdf in browser plugin

How do I (in my controller) send a pdf that opens in the browser. I have tried this but it only downloads the file (both ie and firefox) without asking.
public ActionResult GetIt()
{
var filename = #"C:\path\to\pdf\test.pdf";
// Edit start
ControllerContext.HttpContext.Response.AddHeader("Content-Disposition", String.Format("inline;filename=\"{0}\"", "test.pdf"));
// Edit stop
return File(filename, "application/pdf", Server.HtmlEncode(filename));
}
After adding the edit above it works as it should, thanks.
You need to set the Content disposition HTTP header to inline to indicate to the browser that it should try to use a PDF plugin if it is available.
Something like: Content-Disposition: inline; filename=test.pdf
Note that you cannot force the use of the plugin, it is a decision made by the browser.
This (in addition to the other headers) does the trick for me in a plain .net web app:
Response.AddHeader("Content-Disposition", String.Format("inline;filename=""{0}""", FileName))
I'm not familiar with MVC, but hopefully this helps.
I think this relies on how the client handles PDF files. If it has setup to let Adobe Reader open the files in the browser plugin it will do that, but maybe you have set it up to download the file rather than opening it.
In any case, there is no way of controlling how PDF files will be opened on the user's machine.

How to write a simple .txt content processor in XNA?

I don't really understand how Content importer/processor works in XNA.
I need to read a text file (Content/levels/level1.txt) of the form:
x x
x x
x x
where x's are just integers, into an int[,] array.
Any tips on writting a SIMPLE .txt importer??? By searching google/msdn I only found .x/.fbx file importer examples. And they seem too complicated.
Do you actually need to process the text file? If not, then you can probably skip most of the content pipeline.
Something like:
string filename = "Content/TextFiles/sometext.txt";
string path = Path.Combine(StorageContainer.TitleLocation, filename);
string lineOfText;
StreamReader sr = new StreamReader(path);
while ((lineOfText = sr.ReadLine()) != null)
{
// do something
}
Also, be sure to set the "Build Action" to "None" and the "Copy to Output Directory" to "Copy if newer" on the text files you've added. This tells the content pipeline not to compile the text file but rather copy it to the output directory for use as is.
I got this (more or less) from the RacingGame sample provided by Microsoft. It foregoes much of the content pipeline and simply loads and processes text files (XML) for much of its level data.
XNA 4.0 uses
System.IO.Stream stream = TitleContainer.OpenStream("tilename.txt");
See http://msdn.microsoft.com/en-us/library/bb199094.aspx and also http://blogs.msdn.com/b/shawnhar/archive/2010/12/09/reading-files-in-xna-game-studio-4-0.aspx
There doesn't seem to be a lot of info out there, but this blog post does indicate how you can load .txt files through code using XNA.
Hopefully this can help you get the file into memory, from there it should be straightforward to parse it in any way you like.
XNA 3.0 - Reading Text Files on the Xbox
http://www.ziggyware.com/readarticle.php?article_id=69 is probably a good place to start. It covers creating a basic content processor.

Resources