Process to automatically divide a loosely structured document into sub-sections? - machine-learning

I have some documents that are semi-structured. All of them are job posts scraped from various sources. Every document has a requirements section, a qualifications section and so on.
The issue is that there is no fixed format for these. Sometimes, section names are bold, sometimes they are in h1, h2 tags and sometimes, they are same as rest of the text.
What would the process of dividing these documents into smaller parts called? Is there a known term for it that I could search for?

Related

Combining multiple spreadsheet files in to one master overview (read only)

Let me start by saying that I know too little about coding etc to translate some of the solutions given on this platform to solve my issue. So hopefully someone can help me get started..
I am trying to combine a certain section of multiple google spreadsheet files with multiple tabs into one file. The name and number of the various tabs are different (and change over time).
To explain. We have for various person an overview of their projects (each project on its own tab). Each project/tab contains a number of to do's. What I need to achieve is to import al the to do's to a master list so that we have 1 master overview (basically a big to do list that I can sort on date).
Two exmples with dummy information. The relevant information starts on line 79
https://docs.google.com/spreadsheets/d/1FsQd9sKaAG7hKynVIR3sxqx6_yR2_hCMQWAWsOr4tj0/edit?usp=sharing
https://docs.google.com/spreadsheets/d/155J24uQpRC7uGvZEhQdkiSBnYU28iodAn-zR7rUhg1o/edit?usp=sharing
Since this information is dynamic and you are restricted from using app script, you can create a "definitions" or "parameters" sheet where the person must either report the NAMES of their projects and the ROW the tasks starts on and total length. From there you can use importrange function to get their definitions. From the definitions you can use other import range functions to get their tasks list. Concatenating it is gonna be a pretty big issue for you though.
This unfortunately would be much easier for you to accomplish with a different architecture to your docs / sheets. The more a spreadsheet looks like a database (column heads and rows of data that match those headers), the easier they are to work with. The more they look like forms / paper worksheets the more code you would need to parse that format.

Storing words in a text

I am building an application for learning languages, with Rails and Postgresql.
Texts get uploaded. The texts will be of varying length, but let’s assume they’ll be 100-3000 words long.
On upload, each text position gets transformed into a “token”, representing information about the word at that position (base word, noun/verb/adjective/etc., grammar tags, definition_id).
On click of a word in the text, I need to find (and show) all other texts in the database that have words with the same attributes (base_word, part of speech, tags) as the clicked word.
The easiest and most relational way to do this is a join table TextWord, between the table Text and Word. Each text_word would represent a position in the text, and would contain the text_id, word_id, grammar_tags, start_index, and end_index.
However, if a text has between 100-3000 words, this would mean 100-3000 entries for each text object.
Is that crazy? Expensive? What problems could this lead to?
Is there a better way?
I can’t use Postgres full text search because, for example, if I click “left” in “I left Nashville”, I don’t want “take a left at the light” to show up. I want only “left” as a verb, as well as other forms of “leave” as a verb. Furthermore, I might want only “left” with a specific definition_id (ex. “Left” used as “The political party”, not “the opposite of right”).
The other option I can think of is to store a JSON on the text object, with the tokens as a big hash of hashes, or array of hashes (either way). Does Postgresql have a way to search through that kind of nested data structure?
A third option is to have the same JSON as option 2 (to store all the positions in a text), and a 2nd json on each word object / definition object / grammar object (to store all the positions across all texts where that object appears). However, this seems like it might take up more storage than a join table, and I’m not sure if it would bring any tangible benefit.
Any advice would be much appreciated.
Thanks,
Michael.
An easy solution would be to have a database with several indexes: one for the base word, one for the part-of-speech, and one for every other feature you're interested in.
When you click on left, you identify it's a form of "leave", and a "verb" in the "past tense". Now you go to your indexes, and get all token position for "leave", "verb", and "past tense". You take the intersection of all the index positions, and you are left with the token positions of the forms you're after.
If you want to save space, have a look at Managing Gigabytes, which is an excellent book on the topic. I have in the past used that to fully index text corpora with millions of words (which was quite a lot 20 years ago...)

Wikipedia pageviews analysis

I've been challenged with wikipedia pageviews analysis. For me this is the first project with such amount of data and I'm a bit lost. When I download the file from the link and unpack it, I can see that it has a table-like structure with rows looking like this:
1 | 2 |3|4
en.m The_Beatles_in_the_United_States 2 0
I struggle with finding out what exactly can be found in each column. My guesses:
language version and additional info (.m = mobile?)
name of the article
The biggest concern I have with two last columns. The last one has only "0" values in it and I have no idea what it represents. I'd assume then that the third one show number of views but I'm not sure.
I'd be grateful if someone could help me to understand what exactly can be found in each column or recommend some reading on this subject. Thanks!
After more time spent on this, I've finally found solution. I'm posting this in case someone has the same problem in the future. Wikipedia explains what can be found in database. These explanations were painful to find but you can access theme here and here.
Based on that you can see that rows have following structure:
domain code
page_title
count_views
total_response_size (no longer maintained)
Some explanations for each column:
Column 1:
Domain name of the request, abbreviated. (...) Domain_code now can
also be an abbreviation for mobile and zero domain names, in which
case .m or .zero is inserted as second part of the domain name (just
like with full domain name). E.g. 'en.m.v' stands for
"en.m.wikiversity.org".
Column 2:
For page-level files, it holds the title of the unnormalized part
after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For
project-level files, it is - .
Column 3:
The number of times this page has been viewed in the respective hour.
Column 4:
The total response size caused by the requests for this page in the
respective hour. If I understand it correctly response size is
discontinued due to low accuracy. That's why there are only 0s. The
pagecounts and projectcounts files also include total response byte
sizes at their respective aggregation level, but this was dropped from
the pageviews and projectviews files because it wasn't very accurate.
Hope someone finds it useful.
Line format:
wiki code (subproject.project)
article title
monthly total (with interpolation when data is missing)
hourly counts
(From pagecounts-ez which is the same dataset just with less filtering.)
Apparently buggy though; it takes the first two parts of the domain name for wiki code, which does not work for mobile domains (which are in the form <language>.m.<project>.org).

Notice ''Too many URLs ''

I get notice in my google analytics panel:
Too many URLs
The number of unique URLs for owner review All site data exceeds the daily limit. Excess data is displayed in the summary row (left) in the reports.
Google Analytics summarizes data when too many rows in a table in one day. When you send too many unique URLs, surplus value summary is displayed in a single line reports the value (left) for the URL. This severely slows down your ability to perform detailed analysis of the URL.
Too many URLs are usually the result of a combination of a number of unique URL parameters. To avoid surpassing the limitations, the typical solution could be the exclusion of irrelevant parameters from URLs. For example, open the section Administator> Settings ownership review and use the setting Excluding query parameters in the URL to exclude parameters such as sessionid and vision. If your site supports a site search, use the Search Settings sites to track the parameters related to the search while at the same time removes from URLs
How this will affect on my website ?
I not understand why i get this error and how to fix this ?
I check this what google suggest:
Administator> Settings ownership review and use the setting Excluding query parameters in the URL to exclude parameters such as sessionid and vision.
Can anyone explain me how to use this for fix problem ?
Thanks.
It does not affect your website, is affects the GA reports only.
The url for any pageview is stored in the "page" dimension. Google Analytics can display at maximum 50 000 distinct values for this dimension for the selected timeframe. In your case there are more than 50 000 values, so any excess pages will be grouped together in a row labeled "other".
Now it may be that you have more than 50 000 distinct urls in your website, but Google thinks that this is unlikely, so it suggests to check if you are "artifically" inflating the number of distinct values for the page dimensions.
A bad but simple example: Imagine you allow your users to choose their own background color for your site, and that the choice of color was transmitted in an query parameter. So you might have e.g.
index.php?bgcolor=#cc0000
index.php?bgcolor=#ee5500
index.php?bgcolor=#000000
....
Due to the query parameter these urls would show up as three different pages, i.e. three different rows in the Google Analytics reports, despite the fact that in all cases the same content is displayed (albeit with different background colors).
The proper strategy in this case would be to go to the admin section, view settings, and in the "ignore url parameters" box insert the bgcolor parameter (and all other parameters that do not change the content that is display). Now the parameter will be stripped from the url before the data is processed, and the pageview will be aggregated into a single row with index.php as single value for the page dimension (of course you have to insert your own query parameter names).

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.
With this constrain(working on small set of texts), how can I generate tags ?
Regards
Two Stage Approach for Multiword Tags
You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.
For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.
Tweet Level PMI for Single Word Tags
As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.
PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet))
Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.
General Changes for Tweets
Some changes you might want to make when tagging with tweets include:
Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like ##$##$%!.
Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.
I have used a method earlier, for small text content such as SMSes, where I would just repeat the same line two times. Surprisingly, that works well for such content where a noun could well be the topic. I mean, you don't need it to repeat for it to be the topic.

Resources