where the crawled files are stored in Heritrix web crawler - parsing

i want to know where the crawled files are stored in Heritrix web crawler...
thanks and advance

From the developer manual:
By default, heritrix writes all its crawled to disk using ARCWriterProcessor. This processor writes the found crawl content as Internet Archive ARC files. The ARC file format is described here: Arc File Format. Heritrix writes version 1 ARC files 1.
The ARC files are located in the arcs/ folder of your crawl-instance. You can change the location in the settings of the web-GUI of heritrix.
Instead of the default ARCWriterProcessor, you can set it to WARCWriterProcessor (WARC files), to MirrorWriterProcessor (no container at all) or to a Kw3WriterProcessor. AFAIK, you could even set multiple writers. Note that when choosing the MirrorWriterProcessor, not all files may be written to disc, depending on the file system you're using to write the files to.
[1] Internet Archive ARC files

Related

iOS and OSX Document Packages cross platform

I have a iOS and OSX application which is document based and i am saving complex folder hierarchy inside the document so i change my UTI type to document package.
But the problem is that according to apple Document Package is just a folder. If i open the document package on windows or linux machine it consider it as folder despite of having a .abc extension. I figured out that one solution to this problem is that i zip the folder while saving. But i don't think that it is a good approach because every time i open up the file i will have to decompress the folder and compress it again on re-saving.
Is there any other solution to this problem?
I found another better solution.
Solution is to use SQLite database as your document as long as you have text to save on the file systems. In my case i also had images, so i created a table for all the images and a table for all the files contents that i used to write on files. So the document will have custom extension(.abc) which will only be opened with your application.

iOS requirements for data storage (using cache folder versus marking files not to be backedup)

I got my app rejected due to violation of 2.23
After inspection, it would appear that I was indeed not a correct path for storing downloaded images and data files (i.e. files that I would prefer to have available for offline usage, but which the app can re-download again if removed by iOS)
However, after looking at:
https://developer.apple.com/library/ios/qa/qa1719/_index.html
It appears it may not even be enough to proper "cache" path for iOS > 5? Example:
/var/mobile/Applications/00000000-0000-0000-0000-000000000000/Library/Caches/'
Will using above make my app pass this requirement? Or am I forced to using the API for making files not to be backed-up?
Using the caches directory is correct if you can re-download the files. They will not be backed up. You only need to use the "do not backup" flag if the files exist in a location that normally is backed up (e.g. the documents directory).

Open a CBZ/CBR file in IOS

I am looking to integrate opening/viewing CBZ/CBR file in iOS6 (A simple viewer like UIWebView that reads PDF file will be fine as well).
Are are there any libraries (Commercial or free) that are available for opening these file types?
Thanks in Advance
CBR files are renamed .rar files and CBZ files are renamed .zip files, so you can look for a solution from there. I've never come across a library file specifically targeted for them, but that might not exclude the possibility of one existing, but as they're just standard compression files renamed to make them more portable between CBR/CBZ readers you should be okay with standard decompression libraries.
The library will spit out a number of image files when the decompression has finished, if you extract one with a standard decompression tool you'll see how they'll be presented.

CodeModuleManager cannot allocate space for the module

I am trying to write an app that will download and install cod files.
I have the line:
CodeModuleManager.createNewModule(codData.length, codData, codData.length);
which is expected to return a module handle (which is an int). However, it returns 0 that means space cannot be allocated for the module to be intalled. I searched a bit but coulndt really find any info about what may be causing this. Any ideas ?
I found the solution:
I used the COD files inside the deliverables/web directory
When the COD file is above some size it is partitioned into 2 (or more) cod files. In my case there were two COD files. One was named abc.cod and the other one was abc-1.cod. You need to have both cod files in order to make the installation. (which was the real problem)
I noticed that in the deliverables/standard folder there is only 1 cod file which is probably the case that it is not splitted into 2 parts and, thus, the CodeModuleManager is not able to allocate space for it as a whole (thus partitioning is necessary afterall)
However, even after this you can encounter problems such as the icon of your application getting disappeared when you overwrite cod files (ie. when you try to update your app which already exists).
I found it more convenient to work with the .jad file. Just set up the right mime types in your directory and put the jad and cod files in there. Then open your jadfile using the browser and your app should be automatically and smoothly installed/updated by the OS itself.
Hope this proves helpful for someone else

Blackberry - Where is phonegap (www) folder stored on Device?

I am developing for Blackberry usign phonegap and I need to copy all my app files to a writable location (I assume the app file location is red only).
Setting up the plugin to do this is easy enough using Java, however the problem I am having is finding the location of these files specific to my app so that I can copy and move them.
From previous reseaerch it seems that Class.getResourceAsStream would work e.g.
getClass().getResourceAsStream("/index.html");
However do not userstand how this can be specific to my app.
Thanks,
BlackBerry application file is packaged as *.cod file. It is kinda modified java *.jar file with hierarchical structure (folders, packages inside of the archive).
When you run getClass().getResourceAsStream("/index.html"); you get index.html file from the root package of your *.cod file. If there's a file attached upon compilation process, then you will get it, otherwise the operation fails.
As you want to use a writable media, then consider FileConnections API of the RIM SDK.
cod files are stored to a special location (not the filesystem). But you will need to deal with the filesystem if you want to write files to the device memory or memory card.

Resources