Resume Parsing using Solr and TIKA - parsing

I was going through this slide. I'm getting little difficulty in understanding the approach.
My two queries are:
How does Solr maintain schema of semi-structured document like
resumes (such as Name, Skills, Education etc)
Can Apache TIKA extract the section wise information from PDFs? Since every resume would have dissimilar sections, how do I define a
common schema of entities?

You define the schema, so that you get the fields you expect and can search in the different fields based on what kind of queries you want to do. You can lump any unknown (i.e. where you're not sure about where it belongs) values into a common search field and rank that field lower.
You'll have to parse the response from Tika (or a different PDF / docx parser) yourself. Just using Tika by itself will not give you an automagically structured response tuned to the problem you're trying to solve. There will be a lot of manual parsing and trying to make sense of what is what from the uploaded document, and then inserting the relevant data into the relevant field.

We did many implementations using solr and elastic search.
And got two challenges
defining schema and more specific getting document to given schema
Then expanding search terms to more accurate and useful match. Solr, Elastic can match which they get from content, but not beyond that content.
You need to use Resume Parser like www.rchilli.com, Sovrn, daxtra, hireability or any others and use their output and map to your schema. Best part is you get access to taxonomies to enhance your content is solr.
You can use any one based on your budget and needs. But for us RChilli worked best.
Let me know if you need any further help.

Related

Is there a known workaround for the max token limit on the input to GPT-3?

For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.
Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.
My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.
I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.
I'm open to other suggestions as well. For instance, I've considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don't want this to be a limitation.
Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or elements that have certain characteristics, like having a list within them, which is something most recipes have.
this framework might be useful to you: https://github.com/Xpitfire/symbolicai
The basic idea is:
You could stream among your input data and build up a stack on the side.
Next, in your training procedure, you need to account for having loosely connected chunks of data. This you could overcome by indexing or clustering the chunks before designing your prompts.
This means, if you want to create a query for a question that is related to your long data stream, you could search through your indexes and retrieve the related information.
Now you need to parse together your few-shot learning prompt that accounts for a "section" in your prompt that relates to your query and another one for the facts you wanted to include.
Finally, you can then feed that into your model and provide examples of what you want your model to be tuned to.
I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.

Best way to link specific labels in the test of a webpage to a specific client login

I have a website I am developing that will be deployed to several different clients. All of the functionality is the same and the vast majority of the language used is the same. However, some of the clients are in different industries so specific words and phrases within some pages need to changed based off of the company of the individual logged into the site. What is the best way to accomplish this?
In the past I have seen people use string database tables but that seems rather cumbersome. I thought about using localization but I don't want another developer to get confused because it isn't a change in spoken languages.
For this you can use something like a word list. I don't know whether word list is a well know concept or not but let me try to explain it to you.
You can add the information that distinguishes each login from other based on the companies in one table in your database and map it to corresponding words you wanna use for the respective English or default word in another table.
Now I am assuming that these words do not change very often. So what you can do is on application start, load it to a convenient memory data structure.
Now all the text you want to process will go through a word list processor which is basically a program code that identifies the group in which the login is and identifies the words to be replaced. Then it replaces those words based on the appropriate group and returns back the transformed text which you can display in the UI.
So here the advantage is, once the data is loaded into the memory data structure, you don't need to read the values from your DB.
Moreover, if there is any change in the word lists or if you want to give user the handle to change the words according to their preference, you can directly modify the memory data structure and then later refresh it in the DB asynchronously.
Also since the call for mapping is directly from the memory, its faster than DB calls.
And since its a program code, typically a method or something, its totally up to you which text to process and which to ignore.
This is a technique which we used in our application when we had a similar requirement. I hope this suggestion of solution to this problems helps !
Better alternatives and suggestions are always welcome since we would also want to improve our solution to this problem. Thanks.

Elastic Search - Get the matching field

I'm using ElasticSearch to implement search on a Webapp (Rails + Tire). When querying the ES server, is there a way to know what field of the Json returned matched the query?
The simplest way is to use the highlight feature, see support in Tire: https://github.com/karmi/tire/blob/master/test/integration/highlight_test.rb.
Do not use the Explain API for other then debugging purposes, as this will negatively affect the performance.
Have you tried using the Explain API from elastic search? The output of explain gives you a detailed explanation of why a document was matched, and it's relevance score.
The algorithm(s) used for searching the records are often much more complex than a single string match. Also, given the fact that you have the possibility of a term matching multiple fields (with possibly different weights), it may not be easy to come up with a simple answer. But, looking at the output of Explain API, you should be able to construct a meaningful message.

Opening a PDF file and searching for names there

I have a PDF file. And I want to search for names there.
How can I open the PDF and get all its text with Ruby?
Are there are any algorithms to find names?
What should I use as a search engine: Sphinx or something simpler (just LIKE sql queries)?
To find proper names in unstructured text, the technical name for the problem you are trying to solve is Named Entity Recognition or Named Entity Extraction. There are a number of different natural language toolkits and research papers which implement various algorithms to try to solve this problem. None of them will get perfect accuracy, but it may be good enough for your needs. I haven't tried it myself but the web page for Stanford Named Entity Recognizer has a link for Ruby Bindings.
Tough question. These domains remain in the research area of semantic web. I can only suggest some tracks but would be curious to know your definite choice.
I'd use pdf-reader: https://github.com/yob/pdf-reader
You could use a Bloom Filter matching some dictionary. You'd assume that words not matching the dictionary are names... Not always realistic but it's a first approach.
To get more names, you could check the words beginning with a capital letter (not great but we keep on finding some basic approaches). Some potential resource: http://snippets.dzone.com/posts/show/4235
For your search engine, the two main choices using Rails are Sphinx and SolR.
Hope this helps!

Rails CMS: static files or database records?

I'm trying to figure out the cut-off with respect to when a "text entry" should be stored in the database vs. as a static file. Are there any rules of thumb here? The text entries will be at the most several paragraphs and have links to images and tables (and hyperlinks to other text entries). Some criteria for the text entry:
I'm thinking of using DITA as the content format
The text should be searchable
If the text is revised, a new version will be created
thanks in advance, Chuck
The "rails way" would be using a database.
The solution will be more scalable, therefore faster and probably easier to develop with (using migration and so on). Using the file system, you will have to build lots of functions on your own, that are already implemented for database usage.
You could create a Model (e.g.) Document and easily use existing versioning systems, like paper_trail. When using an indexed search, you can just have an has_many relation enabling you to realise the depencies between the models (destroy a model means to destroy the search index).
Rather than a cut-off, you could look at what databases provide and ask yourself if those features would be useful. Take Isolation (the I in ACID): if you have any worries that multiple people could be trying to edit an entry at the same time, a database would handle that well while you'd have to handle the locks yourself working with files. Or Atomicity: you might want to update two things at once (e.g. an index page and an entry page) and know they will either both succeed or both fail.
Databases do a number of things beyond ACID, such as taking advantage of multiple datatypes, making querying easier, and allowing for scaling. It's a question worth asking since most databases end up having data stored in a bunch of files on disk. Would you end up writing a mini-database if you used files yourself?
Besides, if you're using rails you mind as well take advantage of its ActiveRecord functionality, and make it possible to use the many plugins that expect a database.
I'd use a database for even a small, single user rails app.

Resources