What I understood from DocumentAI docs is that the best match to extract information from a report like medical test result is to use the Form Parsing processor. This does a good job for reports where there is exactly one value for one label. Like patient name or patient age etc. But I was trying to get the table of various test results in a map of Key Value pair where key is the test name and value us the result.
With custom processor I tried to choose a label with property which can appear multiple times but that does not maintain the link between testName and testValue.
The Report looks like the follows
Desired Result would probably be
{
name : Jon Doe
age : 76
tests :[
{
testName : CRP ,
testValue : 51
},
{
testName : Creatinine ,
testValue : 0.8
}
]
}
I think it would be something similar to table.
https://cloud.google.com/document-ai/docs/handle-response
The Form Parser Processor allows for Table Parsing when it can detect tables in the document. This sample code shows how the formFields and tables can be extracted.
https://cloud.google.com/document-ai/docs/handle-response#forms_and_tables
This Form Parser Codelab also shows a few more examples, like transforming the formFields & Tables into a Pandas DataFrame.
https://codelabs.developers.google.com/codelabs/docai-form-parser-v1-python
You can also create a Custom Document Extractor processor that makes a custom model for the specific document structure, but you will have to label example documents and train a new version.
Note, this creates an Entity Extraction processor, which works differently than the Form Parser (and doesn't currently extract form fields & tables in the same way).
You'll need to label each entity individually, train the processor, and use this sample code to get the entity information from the processing response.
https://cloud.google.com/document-ai/docs/handle-response#entities_nested_entities_and_normalized_values
Related
I am trying to train a neural net. to extract specific information from text. I need to find entity information like attribute, dependency, etc.
The text input will be like this:
doc = "Student has name, surname, id, and number..."
I have preprocessed this doc. and have POS-Tags and dependency information to feed my network. On the other hand, the wanted output is like:
Entity.
Attribute.
Relation
name....
name, type...
name, cardinality, etc.
I do have a dataset of wanted output but have a problem about to feed the system.
I have embedded the X but had problem about to embed Y.
Example of one Y row :
y = [("Student","Course"),0,(("name","id","address"),("title","level","credits"),((0,0,1),(0,0,0)),((0,0,0),(0,0,0)),((0,1,0),(0,1,0))),(("takes"),"Student","Course",((1,"N")),(("N",1)))]
so, I want to create multiple input and output models but am stuck with the feeding model.
Overall, I have a structure like a list of lists or tuples. Lenght of lists might be changed according to the input.
Any ideas?
I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.
I have a large set of documents. Each document contains multiple keywords. I would like to create an OLAP cube that calculate the co-occurrence of keywords in this set. is it possible to perform such solution (using olap cube). in this case what would be the attributs of the fact table, the dimensions , the measure and the aggregation function. Also what tool do you suggest.
An example of document : (in JSON form) actually form doesn't matter
example of document
In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.
Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.
This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.
PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.
I think what you are talking about is Gazetteers list (dictionary.txt).
You would have to include corresponding feature for a word in training data and then specify it in template file.
For Example: Your list contains the entity: Hershey's
and training data has a sentence: I like Hershey's chocolates.
So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's.
You also have to include this column as feature in template file.
To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4
EDIT: (after viewing the OP's comment)
Sample Training Data with extra features: https://pastebin.com/fBgu8c67
I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.
Sample Template File for the above data : https://pastebin.com/LPvAGCVL
Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.
I have a trained model in Amazon ML than can recognize a piece of text in several categories. For example, if give a piece of text to Amazon ML it will return if this piece is a "subject", "content", etc category.
I wonder if it's possible to send a full text and get a return telling me what is the subject and what is the content.
You should think of each ML model as a function, and then you can write code that is integrating these functions.
For example, one model can represent the function "isSubject", and the second model "isBody", etc. These functions can be built on Amazon ML, and they will return a score between 0 and 1, that you can convert into Boolean in the code that is calling these functions/models.
You can now take your full text, cut it to pieces and send each piece to the various models you trained, and use the scores that the models return as the basis for your decision regarding the full text.