Storing words in a text - ruby-on-rails

Storing words in a text - ruby-on-rails

I am building an application for learning languages, with Rails and Postgresql.
Texts get uploaded. The texts will be of varying length, but let’s assume they’ll be 100-3000 words long.
On upload, each text position gets transformed into a “token”, representing information about the word at that position (base word, noun/verb/adjective/etc., grammar tags, definition_id).
On click of a word in the text, I need to find (and show) all other texts in the database that have words with the same attributes (base_word, part of speech, tags) as the clicked word.
The easiest and most relational way to do this is a join table TextWord, between the table Text and Word. Each text_word would represent a position in the text, and would contain the text_id, word_id, grammar_tags, start_index, and end_index.
However, if a text has between 100-3000 words, this would mean 100-3000 entries for each text object.
Is that crazy? Expensive? What problems could this lead to?
Is there a better way?
I can’t use Postgres full text search because, for example, if I click “left” in “I left Nashville”, I don’t want “take a left at the light” to show up. I want only “left” as a verb, as well as other forms of “leave” as a verb. Furthermore, I might want only “left” with a specific definition_id (ex. “Left” used as “The political party”, not “the opposite of right”).
The other option I can think of is to store a JSON on the text object, with the tokens as a big hash of hashes, or array of hashes (either way). Does Postgresql have a way to search through that kind of nested data structure?
A third option is to have the same JSON as option 2 (to store all the positions in a text), and a 2nd json on each word object / definition object / grammar object (to store all the positions across all texts where that object appears). However, this seems like it might take up more storage than a join table, and I’m not sure if it would bring any tangible benefit.
Any advice would be much appreciated.
Thanks,
Michael.

An easy solution would be to have a database with several indexes: one for the base word, one for the part-of-speech, and one for every other feature you're interested in.
When you click on left, you identify it's a form of "leave", and a "verb" in the "past tense". Now you go to your indexes, and get all token position for "leave", "verb", and "past tense". You take the intersection of all the index positions, and you are left with the token positions of the forms you're after.
If you want to save space, have a look at Managing Gigabytes, which is an excellent book on the topic. I have in the past used that to fully index text corpora with millions of words (which was quite a lot 20 years ago...)

Related

List or auto-complete existing string values, for new plaintext cell

I have a plaintext column in my Google spreadsheet, several rows (cells) of which have already been filled with a limited number of strings, let's say for simplicity "January", "February", "March", etc.
I would like to format the column such that, when entering text in new (empty) cells, rather than having to type the text from scratch, I instead get to choose from a drop-down list populated with the strings that already exist in other cells of that column (all 12 months, in the example above).
Or, alternatively, to have an auto-complete that would suggest, say, "March" and "May", once I start typing "M". Strangely, I haven't seen this basic feature at work in GSheets for a while, even though the EnableAutocomplete option is checked in the menu.
Among the two options, I would prefer the one with the drop-down list over the autocomplete one, but ultimately either would be of massive help. The idea is, once the number of unique strings becomes high (but there is also a lot of repetition), to reduce the chance of making a typo when entering new values just because they happen to differ by one letter from a string that already exists elsewhere.
Is there a way to do this just via the GUI/addons? I know this is possible to do in Excel for the header row (screenshot below), but I don't know of a way to do that also in GSheets, and in either case, what I need is to have this sort of selection list at the cell- rather than at the header-row level.

what you are looking for is called Data Validation:
and you can select various options for criteria

Rails elegantly storing metadata for text

My app has thousands (maybe millions?) of models, let's call them Paragraphs, that contain text. The primary use of that text is to display it on a webpage. Sometimes that text is searched over for various other reasons too.
Some of the words in some of these paragraphs have associated metadata, like formatting, hyperlinks or other data-attributes that have meaning for my javascript in the front end.
Right now, I'm just sticking the ultimate html tags straight into the text, so it ends up being stored like this:
<strong>Jimmy</strong> is walking his dog which is <span class="something" data-metadata_id="2343">brown</span>.
This works well for the primary purpose of displaying the text, but is very ugly when I want to search over my text, or do other processing on it. Is there a better way? Is there a gem that handles this sort of thing?

It makes sense to put both versions in your database: a display one and an index one. Disk is cheap. Especially if you're using Solr or similar (very recommended if you're doing string search), you can store (but not index) the HTML, and index (but not store) the plain text version, in two different fields of the same record.

Generating values for dropdown ONLY for 'C' of CRUD

When choosing 'Add' in CRUD, how best to generate a list of choices to pick from a dropdown?
For U/update - just display what's there...
The field contents starts with a letter, followed by five numeric digits:{A-I,K-N,Z}#####
Each letter has a different 'max' value for the numeric part.
So when adding a new record, I'd like to offer a listbox with one of each letter and that letter's highest numeric value + 10.
So, if the max 'A' as A00120, and max 'B' B00030 (etc) the listbox would have A00130 and B00040.. etc
Save the user having to figure out which is 'next' when generating a new record.
? Thanks,
Mark

This time I'll not be able to come up with ready to use solution, but I must say - everything is possible with ATK4. You just have to customize and extend it to fit your needs :)
Speaking about your question above - I guess you have to split it in multiple parts.
First part is about how to show select box on Create and readonly or disabled field on Update. I guess you can make some custom Form field or pin some action to existing Form Field hook. Not sure exactly what's better in this case.
Second one is about data structure. I believe that this field actually should be 2 fields in DB and maybe (only maybe) merged together in ATK model with addExpression() just for user interface needs to display these 2 fields as one field easily in UI. Maybe such concatenated field will be useful also for searching, but definitely not as only one field stored in DB. Imagine how hard it'll be for DB engine to find max value if such field. Store it like type = letter, num = number and then search like SELECT max(num)+10 FROM t WHERE type='A'
Finally Third part is about how to generate this next number. I read your question 3 times and came to conclusion that actually you don't have to show this +10 numeric value in UI at all if it's hardly predefined anyway. Actually that'll not work correctly if this will be multi-user system which I guess it will. Instead just show simple select box with letters {A-I,K-N,Z} and calculate this next value exactly before inserting data in DB. That can be done using models insert hook. This will be much more appropriate solution and will be better for UI and also more stable because calculated in model not incorrectly in UI part.

Identifying the most relavant document in a information retrieval system

I am developing a search engine modeled after google in my spare time.
I am using the original google research paper located at http://infolab.stanford.edu/~backrub/google.html as my guideline.
As i am developing a very very simplified version of google i am not using pagerank algorithm at all for now.
So far i have developed a simple parser and indexer whose result is that i have an inverted index containing number of hits, hit location and document hash against each unique word.
Now i am trying to develop a query engine. However i am finding it hard to identify the most relevant document for a multi token query.
Specifically lets say i am having difficulty in calculating the proximity of the query words to each other in a document.
I have thought of a algorithm that scans each document for the query words and calculates the proximity score based on how much the query words are close to each other however i suspect this would take a long time, and i think there is a better way to do this of which i am not aware and the research paper is too general to get an answer.
I am just looking for a pointer in the right direction.
Any sort of help would be very very very appreciated.

Look at the inverted index section of "Search Engine Indexing" on Wikipedia http://en.wikipedia.org/wiki/Search_engine_indexing#Inverted_indices
Basically, you want to save the position information of a given word within a document, this makes it easy to compute proximity. This information is saved in the index.
The key point is to index your documents so you don't need to scan them every time. The search for keywords is done on the index that points to the documents containing those keywords.
P.S. don't forget that you're trying to keep the index as small as possible, so storing gaps or differences for word positions will save same memory (as explained in: J. Zobel, A. Moffat - Inverted Files for Search Text Engines at page 23).

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.
With this constrain(working on small set of texts), how can I generate tags ?
Regards

Two Stage Approach for Multiword Tags
You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.
For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.
Tweet Level PMI for Single Word Tags
As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.
PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet))
Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.
General Changes for Tweets
Some changes you might want to make when tagging with tweets include:
Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like ##$##$%!.
Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.

I have used a method earlier, for small text content such as SMSes, where I would just repeat the same line two times. Surprisingly, that works well for such content where a noun could well be the topic. I mean, you don't need it to repeat for it to be the topic.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart