Rails elegantly storing metadata for text - ruby-on-rails

My app has thousands (maybe millions?) of models, let's call them Paragraphs, that contain text. The primary use of that text is to display it on a webpage. Sometimes that text is searched over for various other reasons too.
Some of the words in some of these paragraphs have associated metadata, like formatting, hyperlinks or other data-attributes that have meaning for my javascript in the front end.
Right now, I'm just sticking the ultimate html tags straight into the text, so it ends up being stored like this:
<strong>Jimmy</strong> is walking his dog which is <span class="something" data-metadata_id="2343">brown</span>.
This works well for the primary purpose of displaying the text, but is very ugly when I want to search over my text, or do other processing on it. Is there a better way? Is there a gem that handles this sort of thing?

It makes sense to put both versions in your database: a display one and an index one. Disk is cheap. Especially if you're using Solr or similar (very recommended if you're doing string search), you can store (but not index) the HTML, and index (but not store) the plain text version, in two different fields of the same record.

Related

Querying lucene index with arbitrary long article text to check for all matches within article (through neo4j)

I'm trying to query the lucene index I've added to a neo4j field (it's a "name" field, that isn't very long, one to ten words at most).
What I do right now is take all the text in a given webpage, sanitize it with a javascript function to keep only words, spaces and alphanumeric characters, and use that to query my index.
.replace(/[^\w\s]|or|and|not|return+/gi, "") // <- escaping the input
I'm not sure if the length of the search text is limited somehow, but results do seem to disappear after about 1050 words (~6500 characters).
Ideally, I'd like to be able to use a couple thousand words in one query, with the end goal of highlighting the matches found within the webpage itself.
Why is my query not returning any results past a certain number of characters ? Am I missing some keyword in my escaping regex ?
Is what I'm trying to achieve feasible ? Is there a better approach I could use ?
Thanks for reading :)
(for anyone finding this, I found a somewhat related question here: Handling large search queries on relatively small index documents in Lucene)

Storing words in a text

I am building an application for learning languages, with Rails and Postgresql.
Texts get uploaded. The texts will be of varying length, but let’s assume they’ll be 100-3000 words long.
On upload, each text position gets transformed into a “token”, representing information about the word at that position (base word, noun/verb/adjective/etc., grammar tags, definition_id).
On click of a word in the text, I need to find (and show) all other texts in the database that have words with the same attributes (base_word, part of speech, tags) as the clicked word.
The easiest and most relational way to do this is a join table TextWord, between the table Text and Word. Each text_word would represent a position in the text, and would contain the text_id, word_id, grammar_tags, start_index, and end_index.
However, if a text has between 100-3000 words, this would mean 100-3000 entries for each text object.
Is that crazy? Expensive? What problems could this lead to?
Is there a better way?
I can’t use Postgres full text search because, for example, if I click “left” in “I left Nashville”, I don’t want “take a left at the light” to show up. I want only “left” as a verb, as well as other forms of “leave” as a verb. Furthermore, I might want only “left” with a specific definition_id (ex. “Left” used as “The political party”, not “the opposite of right”).
The other option I can think of is to store a JSON on the text object, with the tokens as a big hash of hashes, or array of hashes (either way). Does Postgresql have a way to search through that kind of nested data structure?
A third option is to have the same JSON as option 2 (to store all the positions in a text), and a 2nd json on each word object / definition object / grammar object (to store all the positions across all texts where that object appears). However, this seems like it might take up more storage than a join table, and I’m not sure if it would bring any tangible benefit.
Any advice would be much appreciated.
Thanks,
Michael.
An easy solution would be to have a database with several indexes: one for the base word, one for the part-of-speech, and one for every other feature you're interested in.
When you click on left, you identify it's a form of "leave", and a "verb" in the "past tense". Now you go to your indexes, and get all token position for "leave", "verb", and "past tense". You take the intersection of all the index positions, and you are left with the token positions of the forms you're after.
If you want to save space, have a look at Managing Gigabytes, which is an excellent book on the topic. I have in the past used that to fully index text corpora with millions of words (which was quite a lot 20 years ago...)

Google Sheets import multiple HTML table images

Summary
I'm looking to import a data table from a website that does not appear to have an API. The table is broken down to various images and text. The goal is to have all of the content available in a table to then reference for other sheets.
Issue
When I pull in the data, I get some of the text, none of the other images, and a reference to another table. I looked up some options, but none of them yielded anything but blank cells.
I also tried to use the =IMAGE() formula with a direct link to the images URLs, but there is a portion of the URL that is specific to the unit's release date, and as such, too dynamic to account for.
Excel Formula
=IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3)
Unfortunately without an API it is going to be difficult to achieve what you aim here. These are the main reasons why:
PROBLEMS AND WORKAROUNDS
This table has nested tables that therefore need to be accessed separately. If you take a look at: =IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",4)
you will see how the table 4 of this HTML page is the stats of a random character of the main table. If you go for 5 or 6 you will realise that the nested tables are not even numerically ordered and that you cannot access them by accessing to the main table (i.e mainTable[0].nestedTable). A hard working approach to do this is to go one by one finding their corresponding stat table and placing next to it. For this I recommend extracting only the name field of the main table to be able to align each stat to their character. You can simply do this using:=INDEX(IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3),0,1). You can find out more about INDEX here
IMPORTHTML cannot access images nor links so it will be very difficult to get the images in the last columns. A way to solve this is by using as you mentioned the image with its url like this: =IMAGE("https://gamepress.gg/pokemonmasters/sites/pokemonmasters/files/styles/30x30/public/2019-07/Electric.png?itok=fkRfkrFX"). You can find more info about inserting images here
CONCLUSION
To sum up, there is no easy way to solve this problem. The closest you can get is by:
Importing the name column.
Figuring out which tables belong to which character and placing them with next to their name.
Getting the image url of each weakness and type and add it to each character.
I am sorry this site does not have an API to make things smooth, good luck with your project and let me know if you need anything else or if you did not understand anything.
Here you can find more information about IMPORTHTML

Getting inconsistent tab delimiter width when pasting from Google docs spreadsheet

I am trying to create a gadget for some people, where all they need to do is really copy the contents of a spreadsheet, then paste it in a textbox, which will in turn create a nice table for them to embed in their articles.
I managed to do everything, however Google docs, when copying and pasting data in a text editor, seems to get the size (width) of the tab delimiter wrong between values. So, instead of getting 4 spaces that is the default, i am getting 2 in some cases and so far i managed to find out that the reason is that some of the cells contain strings with spaces. For some reason, this seems to confuse Google docs, thus supplying wrong spacings, which in turn, ruin my script.
I know i can use comma separated values here, but the issue is we are trying to give people the ability to simply copy and paste. Look at the example output below:
School Name Location Type No. eligible pupils
In this example, School Name is one cell, Location is another, Type is another and No. eligible pupils is the last one. It is clear that the first cell does not have the necessary space on the right.
Any ideas? I thought about converting all blank spaces that take more than 1 space to commas, but this might lead to a situation users might actually use 2... which would not work again.
For some reason, it was the code editor that was actually not showing the tabs right. Using a regexp and another code editor (vim) showed that all of them were actual tabs. :)

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

Resources