How to sort websites into categories based on keyword content - keyword

I'm writing a webrobot which categorizes sites based on there keyword/meta/links into a predefined list of categories.
I've been looking at various ontology approaches and have looked at Wordnet (for the hypernym/hyponym), ResearchCyc , WebKb and was wondering if this was as hard a problem as I'm thinking or has it been solved somewhere else before.
Essentially I have large stacks of sorted keyword values and would like to use them to match against a category name. My current thoughts are to check against the category name in some kind of ontology hierarchy.
Has anyone else approached a ontology based problem like this?
Cheers!

You might want to look at text mining, specifically keyword mining or subject indexing, research.

Related

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

Searching by several columns in diffrent tables using one input field

I know that this question or its variations were discussed many times.
But I didn't find exactly my solution.
I have many tables assume 10. I would like to provide a solution for searching in table1.field1, table2.field2, ... table10.field10 using only one form input field. When User is typing a searching phrase Autocomplete field have to suggest a list of available results with note a table its goes from.
First idea that comes in my mind is denormalization i.e. creation a separate table Table_for_search(field1, ..., field10) that contains data from tables was mentioned above.
My problem is I wouldn't like to use any heavy solutions like Sphinx, ElasticSearch etc. Also light solutions like meta_search.gem, ransack.gem look like not suitable for may situation.
Thanks in advance.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

is there an algorithm to find out which words in a search-string belong together?

I was thinking about text driven search by user input.
often you are searching in a database of addresses, where you can find customers and so on.
has anybody any idea how to find out which of the typed words is the name, which is the street name, which is the company name?
and secondly if the name is a double name like "Lee Harvey", how can I find out that the two words Lee and Harvey belong together?
Same problem with company names like "frank the baker inc."...
Is there any algorithm or best practice strategy?
thanks for links, tutorials, scripts and all other help ;-)
What you basically want is a search engine :) Here are the basic steps you need to follow -
You need to create an 'Inverted Index' of the content you want to be searched on.
The index is 'name'=>'value' pair. You can have this pair in whichever way you want (tuned according to your data & needs.
Eg. for your problem of double names, you could split all your names into single words & index it like so -
'lee'=>'lee harvey'
'harvey'=>'lee harvey'
...
this way when anyone searches for 'lee' they get 'lee harvey'. There are other better approaches to this called "n-gram" indexing. Check it out...
You could possibly build indexes of names, addresses, emails etc & when the user types a query check it against all your indexes with the approach suggested above. After you get the results then merge them. Maybe you could introduce the notion of rank so that you can sort your results & show the most latest or most relevant ones at the top. For this you need to figure out a way to score your terms...
Don't care, just perform full-text search. Then you should check the result items for which field contains the search terms. Also, you may display items in separate lists (terms found int name, term found in address). The only difficulty is if John Smith is living in the John Smiht street, you must decide, which list/lists the result item belongs to.

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Resources