Nutch - crawl domain first - url

I am new to Nutch and have very and I try to make it do some specific crawling, i.e. I want it to first go e.g 3 levels deep withing one specific domain(e.g. wikipedia) - that part can be achieved by modifying regex-urlfilter file.
But then I want it to start crawling all external links that it fetched before but only with 1 level depth.
So, my question is, is there any way to get list of crawled links from first run so that they could be used as seeds for second crawling?

You can get the list of crawled urls using this command:
bin/nutch readdb crawl/crawldb -dump file
You can then manually edit the urls/seed.txt file with the output from that command.

Related

In Laravel 5.6, how do I create relative links?

I'm new to Laravel (using 5.6) and can't get my links to work.
My directory structure is: resources/views/pages/samples
In the samples directory, I have 10 blade files I want to link to (named "sample1.blade.php", etc.). I have a "master" links page in the pages directory (one level up from samples).
I've tried the following but can't get any of them to work correctly...
Sample 1
Sample 1
Sample 1
Sample 1
...and a few other variations.
I've also tried adding a base tag to the HTML header but that doesn't help.
Every time I click a link, it says "Sorry, the page you are looking for could not be found."
What am I missing?
Thanks #happymacarts, I didn't realize I had to add a path for every single page in my site.
After adding the paths, the links are working.
I will get into the practice of updating the paths every time I add a page.

Loading a .trig file with inference to Fuseki using the 'tdbloader" bulk loader?

I am currently writing some Java code extracting some data and writing them as Linked Data, using the TRIG syntax. I am now using Jena, and Fuseki to create a SPARQL endpoint to query and visualize this data.
The data is written so that each source dataset gives me a .trig file, containing one named graph. So I want to load thoses files in Fuseki. Except that it doesn't seem to understand the Trig syntax...
If I remove the named graphs, and rename the files as .ttl, everything loads perfectly in the default graphs. But if I try to import trig files :
using Fuseki's webapp uploader, it either crashes ("Can't make new graphs") or adds nothing except the prefixes, as if the graphs other than the default ones could not be added (the logs say nothing helpful except the error code and description).
using Java code, the process is too slow. I used this technique : " Loading a .trig file into TDB? " but my trig files are pretty big, so this solution is not very good for me.
So I tried to use the bulk loader, the console command 'tdbloader'. This time everything seems fine, but in the webapp, there is still no data.
You can see the process going fine here : Quads are added just fine
But the result still keeps only the default graph and its original data : Nothing is added
So, I don't know what to do. The guys behind Jena and Fuseki suggested not to use the bulk loader in the Java code (rather than the command line tool), so that's one solution I guess I'd like to avoid.
Did I miss something obvious about how to load TRIG files to Fuseki? Thanks.
UPDATE :
As it seemed to be a problem in my configuration (see the comments of this post for a link to my config file; I cannot post more than 2 links), I tried to add some kind of specification for some named graphs I would like to see added to the dataset on Fuseki.
I added code to link (with ja:namedgraph) external graphs that I added via tdbloader. This seems to work. Great!
Now another problem : there's no inference, even when my config file specifies an Inference model... I set that queries should be applied with named graphs merged as the default graph, but this does not seem to carry the OWL Inference rules...So simple queries work, but I have 1/ to specify the graph I query (with "FROM") and 2/ no inference in my data.
The two methods are to use the tdb bulkloader offline or you can POST data into the dataset directly. (i.e. HTTP POST operations to http://localhost:3030/ds).
You can test where your graph are there with a query like
SELECT (count(*) AS ?C) { GRAPH ?g { ?s ?p ?o } }
The named graphs will show up when the Fuseki server is started unless your configuration of the SPARQL services only exports one graph.

I can't find where a string is getting defined -- any tricks to find its source?

I'm using:
Rails 3.2x
Spree 1.2
Ruby 1.9.3x
I'm trying to edit the title of one of my pages, and I cannot find where it is getting defined. It is showing up in my base ERB file as 'title', but that name is sufficiently generic to make it next to impossible to find where it is defined.
I have prodded everywhere I can think, I've tried searching for "title =", but nothing is working. I tried calling source_location on it, but that appears to only work on methods.
Any tricks for finding where a variable is defined?
I can't think of an elegant way. A dumb-but-probably-effective way would be to dump stack trace in your erb, then see what those locations are doing and if title is defined there. It has to enter somewhere between the start of program and invoking your erb.
When I can't find something, I use grep -ri some_string . at the command-line to recursively search all the content of the directory.
It's also a good tactic to let your editor search all the source code, since the ones worth using have the ability to search through all files in a directory.
it is created from a mixture of product names, a site config, and something else
An alternate trick is to add a HTML-comment section in your ERB file, and put the pertinent information for the components used to create the title into that section. Then, let the pages be generated and look inside the page's content to determine what table and row ID it is, the site_config filename, etc.
You really should be able to figure it out based on the parts that are concatenated to build the title and then search your database or files. That information isn't magically created out of thin air by Rails; Someone had to tell Rails how to define the title. But, people move on, or they don't document correctly, so try the embedded information trick.

link analysis using nutch

I am new to nutch. I have crawled some urls using nutch. Now I want to get linkrank of them. I read about it here. The problem is that I can't create webGraphdb. In my crawl directory I have linkdb, segments and crawldb directory. I need it when I run the command
./nutch -webgraph -segment <seg name> -segmentDir <seg dir> webgrapgdb??
I need to give the address of webgraphdb. How should I generate it. My nutch version is 1.7.
The webgraph command is for generating or updating of webgraphs. You can pass anything as the value of webgraphdb argument. If a directory with that name does not exist, nutch will create one for you.

How to localize UI files in PySide

I created an application which loads its UI dynamically from UI files (added to resources of application. I have't translate in with poyside-uic, loading as is. now I want lo localize application, but do not understand how.
I tried to generate TS file using pyside-lupdate widge1.ui widget2.ui ... -ts my.ts and got document with multiple context nodes and linguist does not shows all records (only about 7 records and I do not understand which exact).
So, my question: how to translate dynamically loaded UI files?
Found it. Quite simple, but not always obvious.
Execute pyside-lupdate file1.ui file2.ui .... fileN.ui -ts translations\ru_RU.ts. Got a TS file after that with multiple contexts (it is ok, I was wrong)
Open TS with linguist. Ensure that option 'Context' checked in menu "View->View"
...
PROFIT!!!

Resources