How to parse apache solr database - parsing

I am using apache solr 4.10. Its data is provided via crawling by apache nutch (hadoop,hbase system). Solr is using local file system for its index storage. Now I have to parse and remove some bad documents i.e. documents that do not have content etc.
How I can parse it. Is there any way to use hadoop mapreduce for this purpose ?

Related

Google App engine 32MB max request size limit- how to upload large files?

We've got a set up using google appengine with a docker container running a laravel application. Our users need to upload large video files (max 1028MB) to the server which in turn is stored in GCS. But GAE gives an error 413 Request entity too large nginx. I've confirmed this is not an issue on our server configs but a restriction on GAE
This is a pretty common requirement. How do you guys get around this?
What i've tried:
Chunking using this package https://github.com/pionl/laravel-chunk-upload and dropzone.js to break down the file when sending (Still results in 413)
Blobstore API is not applicable for us as we need to constantly retrieved and play the files.
As mentioned by #GAEfan, you can't change this limit on GAE. The recommended approach would be to upload your files to Google Cloud Storage and then process the file from Google Cloud Storage.

Document Indexing Asp.net

I am tasked with indexing a large number of documents (allowing for full-text search) and then searching this index using ASP.net
I am using a Windows Server 2012 environment.
I have done some reading up but I'm still not sure what the indexing service to use.
I have read about 'Microsoft Indexing Services' (I have read this is obsolete) and 'Windows Search' service.
Can anyone make a recommend a suitable service to use and ideally some pointers as to how to use it?
Have a look at Lucene .Net. It's a .Net port of the Lucene search index. It's an Apache project & might be just what you're looking for.

Is it possible to use Nutch 2.x and Apache Gora with plain filesystem as backend storage

Is it possible to use Nutch 2.x and Apache Gora™ with plain filesysem as backend storage?
Official site says:
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x,
but which differs in one key area; storage is abstracted away from any
specific underlying data store by using Apache Gora™ for handling
object to persistent mappings.
I want to use latest version of nutch (2.1 currently), but I don't want to setup complex backend with nosql or rdbms backend for storage now. I want to choose backend storage later.
I didn't find any docs for usage of filesystem as storage for Gora. Is it possible?
You could use the AvroStore that saves into a file (serialized with Avro).
I say this only theoretically since I never used it...

Apache Mahout with JNDI to coonect to MySQLJDBCDataModel

I'm trying to use MySQL database with Apache Mahout to get the Database-based data. From what I read so far, it seems like I have to use a webserver like tomcat to use JNDI for the database connection. I'm wondering if it is possible to use JDNI outside of a webserver.
In short, can I use JNDI and not use a webserver in Mahout?
I know it won't be worth creating a desktop based recommended system. But for the time being, I don't want to run my application inside a webserver.
JNDI is a technology that is not specific to Tomcat, no. It is a directory service, part of J2EE, and supported by most J2EE containers -- like Tomcat, but also JBoss, etc.
I don't quite understand the question, since you would only use JNDI in the context of an app or web server like Tomcat. But you don't want to use Tomcat. So why do you want to use JNDI?
Certainly you don't need JNDI to use Mahout. Just pass it a DataSource that you configured, rather than looked up.

Where to save scraped images?

I'm building a Ruby on Rails app that scrapes the images off a website. What is the best location to save this images to?
Edit:
To be clear, I know the file system is the best type of storage, but where on the file system? I suppose I have to stay in the RoR app directory, but which folder is best suitable for this? public?
On your file server (static Apache server), on your app server (save some where locally in the disk and serve via the app server) or on Amazon S3
But I would suggest not to store in Database. (Some people think it's alright. So, I would be limited to suggestion)
in ROR, under <app_name>/public/images see here -- but the data will be public. If you are worried about privacy, probably this is not right.
If you are concerned about privacy, see the options discussed here How to store private pictures and videos in Ruby on Rails But as a sughestion: serving files from app-server may be painful in high traffic conditions and my experience is it better off-loaded to a file server or a cloud like S3.
It's not hard to write and/or create a server that is only serving images from a file store outside your website's directory structure. A simple rewrite of the URL can provide your code with the info it needs to the actual file location, which it then outputs to the browser.
An alternate is to have the image's URL mapped to the image's directory path in a database, then do a lookup. Make the URL field an indexed lookup and it will be very fast.
I wrote an image server in Ruby a couple years ago along those lines and it was a pretty simple task.

Resources