how to parse (only text) web sites while crawling - parsing

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat.
but i also want to save parsed pages during crawling event
so when i start crawling with like this
bin/nutch crawl urls -dir crawled -depth 3
i also want save parsed html files to text files
i mean during this period which i started with above command
nutch when fetched a page it will also automaticly save that page parsed (only text) to text files
these files names could be fetched url
i really need help about this
this will be used at my university language detection project
ty

The crawled pages are stored in the segments. You can have access to them by dumping the segment content:
nutch readseg -dump crawl/segments/20100104113507/ dump
You will have to do this for each segment.

Related

How to expand or embed the contents of a URL in Microsoft Word?

I need to create a Word document that includes information I also have on a web page. Ideally, I want to prevent duplicating the information, so would like to put the URL of the web page, and have Word "expand" the URL into the actual text from the web page.
If the Word file is opened offline, then the reader sees the contents of the web page as they were when the word file was created.
IF the Word file is opened online, than attempt to update the contents with those that are on the web page.
Sure its possible, but I think you'd need to write and install a VSTO AddIn (or possibly a macro) to do it.

extract data from Pdf using Web harvesting

How can i extract data from PDF using Web Harvesting? I am getting all the relevant PDFs url in a page but i am not been able to extract data out of those Pdf.I am using Web Harvest version 2.0 for extracting the Pdfs url. Please help.
how will i incorporate pdfcommand in web harvesting to get the text? Is there any other way to do without running any batch file?
I think web harvest is not sufficient for this. You should use WGET and pdfbox to get your result. First download all the PDF through your URL into a folder with the help of WGET or Web harvest itself. Then run pdfbox command to get text from PDFs. You may get some knowledge on pdfbox from URL http://pdfbox.apache.org/commandline/. You can also create a batch file to run these things in order.

Can Anemone crawl html files stored locally on my hard drive?

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:
url="file:///C:/2011/index.html"
Anemone.crawl(url) do |anemone|
titles = []
anemone.on_every_page { |page| titles.push page.doc.at
('title').inner_html rescue nil }
anemone.after_crawl { puts titles.compact }
end
So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.
You have couple of problems with this approach
Anemone expect a web address to issue http request and you are passing it a file. You can just load the file with nokogiri instead and do the parsing through it
The links on the files might be full urls rather than the relative paths, in this case you still need to issue http request
What you could do is download the files locally, than traverse through them using nokogiri and convert the links to local path for Nokogiri to load next

Adding server-side script and RSS feed to Sharepoint 2007?

I am investigating if the functionality of some CGI scripts written in Perl that we run on a web server can be migrated to our Sharepoint 2007 server (MOSS).
The CGI scripts are not complicated. Basically they display and process contents of files that resides in the network file system.
For instance one script just displays the contents of small text files that are being added to a specific folder.
These files are part of a production process and cannot be moved into a Sharepoint document archive.
The CGI scripts are being used to give an overview on what is "new in the queue" for this production process.
When the production process has finished, it removes the files from the folder. But new files may arrive to the folder at any time.
I have done some investigations and found that using a "Data View" web part would give possibilities of displaying the data in a good way.
The files need to be transformed from text to XML format, before some xslt could make it look good in a Data View WP. I guess that could be done by some kind of server-side script?
But how and where do I add such a script to Sharepoint?
Would it be a good idea implementing this as an RSS feed instead? But an RSS feed would also require a server-side script, wouldn't it?
I am new to Sharepoint development and would appreciate any useful advice.
Why not just write a Custom WebPart to read the content of those text files and display them. This way you wont be making changes to those text files.
Note : The link to custom Web Part is my blog. There are tonnes of other articles in the net :)

Search Words in pdf files

Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

Resources