Identifying the most relavant document in a information retrieval system - machine-learning

I am developing a search engine modeled after google in my spare time.
I am using the original google research paper located at http://infolab.stanford.edu/~backrub/google.html as my guideline.
As i am developing a very very simplified version of google i am not using pagerank algorithm at all for now.
So far i have developed a simple parser and indexer whose result is that i have an inverted index containing number of hits, hit location and document hash against each unique word.
Now i am trying to develop a query engine. However i am finding it hard to identify the most relevant document for a multi token query.
Specifically lets say i am having difficulty in calculating the proximity of the query words to each other in a document.
I have thought of a algorithm that scans each document for the query words and calculates the proximity score based on how much the query words are close to each other however i suspect this would take a long time, and i think there is a better way to do this of which i am not aware and the research paper is too general to get an answer.
I am just looking for a pointer in the right direction.
Any sort of help would be very very very appreciated.

Look at the inverted index section of "Search Engine Indexing" on Wikipedia http://en.wikipedia.org/wiki/Search_engine_indexing#Inverted_indices
Basically, you want to save the position information of a given word within a document, this makes it easy to compute proximity. This information is saved in the index.
The key point is to index your documents so you don't need to scan them every time. The search for keywords is done on the index that points to the documents containing those keywords.
P.S. don't forget that you're trying to keep the index as small as possible, so storing gaps or differences for word positions will save same memory (as explained in: J. Zobel, A. Moffat - Inverted Files for Search Text Engines at page 23).

Related

Does Google Firestore and/or their Realtime DB have the querying capability to get posts by location (within x miles), order by date, and limit?

I am currently using Firestore for my iOS app and I need to implement a scalable solution for my posts feed. I need to get posts within say 20 miles, order them by date, and limit the amount of posts fetched for pagination. Any and all database solutions would very much appreciated! Thank you!
As a low budget/time alternative to libraries, we have implemented storing the first few digits of lat/long coordinates as a document or collection name and then accessed data that way. The first decimal place gives resolution to around 10 miles or so (exact values for longitude change depending on what latitude you are at). So in your database you could have a collection or document named something like +33.6-112.0. This would mark a reference in Firestore to put all data within (33.8 N, 112.0 W). Be careful with how you round the exact location data before placing it in the respective document or collection.
Then you can retrieve all data at any location you want. This may not give you exactly 20 miles, but some client side sorting can handle that. Note you could make the reference go to any decimal place necessary to achieve the level of precision you are looking for to minimize data base calls (to save you money) and minimize impact on the user's cell data plan.
This is a rather simple solution with limitations, maybe for an MVP, and if not careful could pull way more data than anticipated.
Below is a chart showing the approximate physical distance between each decimal place at the equator. So for example, the distance between (33.3 N, 0 W) and (33.5 N, 0 W) would be about 14 miles.
Neither of those databases have native geospatial querying capabilities. You would have to use some sort of add-on library to help with that. Geofire and Geofirestore are popular for this.

Summarization of text document (Multi document i.e News) By finding events

Respected Sir, Mem
I Wants to summarizing of text document (any unstructured i.e news Data). My first target is to find important events in this given text data and next(2nd step) based on these events i will select some important events (by some methods).
Please tell me some paper to find EVENTS from Text.(If LATEST then will be better)
Please tell me some paper which finding EVENTS using MACHINE LEARNING or SOFT COMPUTING.
THANK YOU
chandrtech15#gmail.com
http://www.google.com/cse?cx=011664571474657673452%3A4w9swzkcxiy&cof=FORID%3A0&q=event+extraction#gsc.tab=0&gsc.q=event%20extraction&gsc.page=1 This is a list of a google search over the ACL (Association for Computational Linguistics) anthology. There should be many relevant papers on the list.

Lookup telephone area code by latitude and longitude

Looking for a way to get a list of telephone area codes for a given latitude and longitude (and if necessary a given intl. code.) Note, I'm not talking about international dialing prefixes but the area codes within them.
For example, Denver Colorado is covered by the area codes 303 and 720. It's at 39.739 -104.985 and is in NANP 1. So given 39.739,-104.985,1 I'd like to get back [303,720].
Libraries, web services, DB's, or raw data that needs to be parsed into a DB, e.g., a web page of shape points, are all fine and the more global coverage the better, but just NANP 1 would be a great help.
Note I already use MaxMind and could turn the lat-lng into a fake IP and use that as the lookup key, but MaxMind claims only U.S. area codes (whether they truly mean U.S. or actually NANP I haven't tested) and seemingly only 1 per location (e.g. just 303 for Denver.) So it's a possibility, just not a great one.
UPDATE: I found some more relevant information, but no definitive solutions so I'm listing it here rather than in an answer:
I was able to find two U.S. databases http://www.area-codes.com/area-code-database.asp and http://www.nationalnanpa.com/area_codes/index.html (50% down the page, MS Access file.) The former includes lat/lng for $450 and the latter would require nearest-neighbor matching as KeithS talks about (it's probably the same DB underlying the NANPA City Query he found.)
Additionally I found information that implies Teleatlas has area code boundary maps and that ESRI includes area code shape files with copies of ArcGIS. Maponics seems to have data available: there's a Google Maps implementation of Maponics' data at http://www.usnaviguide.com/areacode.htm.
Wow. You'll definitely need some sort of pre-existing database of points. My first thought was ZIPList5 Geocode. It includes lat-long data for each active U.S. ZIP code, so you can throw this data in a DB table, index the hell out of it, and search by just about any geographic info you'd have access to. You can buy one copy for $40, with enterprise-level use for $100. Only problem is that this DB has only the "primary" area code for each ZIP code, so metro areas that have more than one (Dallas, Chicago, NYC) aren't going to show all of them.
You could try a two-pronged approach with some free data I found: for a given latitude and longitude, do a nearest-neighbors search of the data in the USGS Geographic Names Information System; it includes information on every human habitation center, and every named landmark feature, with lat/long coordinates of their centers. You now have your lat/long point mapped to the nearest town/city, ZIP code, county, and state. Now, you can compare that against this list of U.S. Area Codes, to find area codes matching any or all of the identifying information from the USGS. This is all free, and will eventually get you what you need, but you'll probably have to do some work to "massage" the two sets of data into something you can efficiently cross-reference, and/or you'll need to implement a good "search engine" that will accurately find nearest-neighbor named points, and then find area codes for locations matching the names.
One more thing to look at is NANPA, which administers area code assignment to begin with. I'm sure they have a more comprehensive downloadable DB, but the only free public access I could find was this search page, which will find area codes for any city with >20k people. You could turn your lat/long data into a city and state, and then hit this search page: NANPA City Query
Here is an option:
http://geocoder.ca/39.739,-104.985?geoit=xml
<TimeZone>America/Denver</TimeZone>
<AreaCode>720,303</AreaCode

Represent the search result by adding relevant description

I'm developing simple search engine.If I search some thing using my search engine it will produce the list of urls which are relating with that search query.
I want to represent the search result by giving small,relevant description under each resulting url.(eg:- if we search something on google,you can see they will provide small description with the each resulting link.)
Any idea..?
Thank in advance!
You need to store position of each word in a webpage while indexing.
your index should contain- word id , document id of the document containing this word, number of occurrence of the word in that document , all the positions where the word occurred.
For more info you can read the research paper by Google founders-
The Anatomy of a Large-Scale Hypertextual Web Search Engine
You can fetch the meta content of that page and display it as a small description . Google also does this.

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.
With this constrain(working on small set of texts), how can I generate tags ?
Regards
Two Stage Approach for Multiword Tags
You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.
For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.
Tweet Level PMI for Single Word Tags
As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.
PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet))
Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.
General Changes for Tweets
Some changes you might want to make when tagging with tweets include:
Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like ##$##$%!.
Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.
I have used a method earlier, for small text content such as SMSes, where I would just repeat the same line two times. Surprisingly, that works well for such content where a noun could well be the topic. I mean, you don't need it to repeat for it to be the topic.

Resources