We are exploring Deep Java Library for Question & Answer application as mentioned in this link http://djl.ai/examples/docs/BERT_question_and_answer.html
public static String predict() throws IOException, TranslateException, ModelException {
// String question = "How is the weather";
// String paragraph = "The weather is nice, it is beautiful day";
String question = "When did BBC Japan start broadcasting?";
String paragraph =
"BBC Japan was a general entertainment Channel. "
+ "Which operated between December 2004 and April 2006. "
+ "It ceased operations after its Japanese distributor folded.";
QAInput input = new QAInput(question, paragraph);
logger.info("Paragraph: {}", input.getParagraph());
logger.info("Question: {}", input.getQuestion());
Criteria<QAInput, String> criteria =
Criteria.builder()
.optApplication(Application.NLP.QUESTION_ANSWER)
.setTypes(QAInput.class, String.class)
.optFilter("backbone", "bert")
.optEngine(Engine.getDefaultEngineName())
.optProgress(new ProgressBar())
.build();
try (ZooModel<QAInput, String> model = criteria.loadModel()) {
try (Predictor<QAInput, String> predictor = model.newPredictor()) {
return predictor.predict(input);
}
}
However instead of a static "paragraph", we want to use index (lucene/solr) data to answer the question. How can we do it?
Sorry this answer is not exactly for Solr, but hope it can help you if you are ok to try OpenSearch
OpenSearch provides semantic search feature. It runs text embedding DL model on top of model serving framework which uses DJL as ML engine. This semantic search feature will convert sentence to dense vector and save to OpenSearch index when ingest data. Then user can use KNN to search similar sentences in OpenSearch index.
OpenSearch doesn't support QA model now, but they have plan to support more NLP models. Someone shows interest on this QA feature, check [Feedback] Machine Learning Model Serving Framework - Experimental Release on OpenSearch forum. You can also add your requirement there, more people asking the feature, OpenSearch will prioritize it.
Related
I am new to clustering techniques and I highly value any input you can provide for my problem bellow.
Basically, I want to cluster URLs based on their structural patterns.
for example
cluster1 - simple URLs https://domain/path/file
cluster2 - shortened URLs
cluster3 - redirect URLs
....
cluster k - new URL pattern
Given a URL dataset, I want to understand how many different URL pattern clusters exists and then visually see the difference.
What I see in the existing methods are clustering domain wise (cluster URLs of the same website together). And this is not what I am expecting. When I try the nlp based (word based) similarity clustering this is happening as the URLs of the same website tend to have same words with little differences.
Instead, I want to focus on the URL structure and identify URL patterns. Removing all the special characters and just creating a bag of words for each URL nullify the URL structure. Can anyone help me to identify a suitable clustering technique as well as a vectorizing technique to identify different URL pattern clusters.
Thanks in advance
Matheesha
Here is an example of how to cluster text.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *eating:* climbing, eating
- *google:* google, squooshy
- *feedback:* feedback
- *face:* face, map
- *impressed:* impressed
- *extension:* extension
- *key:* belly, best, key, kitten, merley
I'm trying to detect contact information in a huge list of offers I get. The offers contain text without any given structure, some examples could be the following ones:
if you're interested, send an email to test#test.com
want to know more? call 000 000 000
come to the public viewing on 25nd of january
public viewing is on the coming wednesday
is this what you're searching for? We're looking forward to hearing from you
As you can see, there are multiple possibilities:
there is no date for a viewing, but there's a phone number
there is no date for a viewing, but there's an email
there is a date for a viewing
there is no detailed information
The tricky point is, there can also be e.g. other dates in text, therefore I can't just parse out dates.
What is the best way to solve something like that? I've already tried it with regex. I think I could get it work but there is an enormous amount of cases which makes it very hard.
I've also looked into things like NLP with libraries like https://spacy.io/ or prodi.gy, but I feel like I'm not on the right track.
The original texts are written in German.
In 2020, how do I go after this?
You can use a NLP powered Rule-based matcher. With spacy, You explored the right tool, just didn't go deep with it. And it's available in german.
Here are some examples:
Some patterns:
#call number
call_pattern = [{'LOWER':'call'},{"ORTH": "(", 'OP':"?"}, {"SHAPE": "ddd"}, {"ORTH": ")", 'OP':"?"}, {"SHAPE": "ddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
#e-mail pattern
email_pattern = [{'LIKE_EMAIL': True}]
#pattern for public viewing
public_viewing_pattern = [{'LOWER': 'public'},
{'LOWER': 'viewing'},
{'POS': 'AUX', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'label': 'DATE', 'OP':'+'}]
Then, you iterate over your patterns and apply them:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#or:
#import de_core_news_sm
#nlp = de_core_news_sm.load()
matcher = Matcher(nlp.vocab)
matcher.add("call_pattern", None, call_pattern)
matcher.add("email_pattern", None, email_pattern)
matcher.add("public_viewing_pattern", None, public_viewing_pattern)
found = {'numbers':[], 'emails':[], 'public_viewings':[]}
for sent in sentences:
doc = nlp(sent)
matches = matcher(doc)
for match_id, start, end in matches:
if doc.vocab.strings[match_id] == 'call_pattern':
found['numbers'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'email_pattern':
found['emails'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'public_viewing_pattern':
found['public_viewings'].append(doc[start:end])
print(found)
result:
{'numbers': [call 000 000 000], 'emails': [test#test.com], 'public_viewings': [public viewing on, public viewing on 25nd, public viewing on 25nd of, public viewing on 25nd of january, public viewing is, public viewing is on, public viewing is on the, public viewing is on the coming, public viewing is on the coming wednesday]}
Ps.: This repeating is caused by a bug in spacy versions prior to 2.1. Just add some manual validation for repeating matches (get the one with most lenght) and you'll be good.
The hard part will be to generalize enough and correctly get your patterns, but they are very powerful and you can do all sort of tweaks to them. Check spacy online demo for testing. Also, refer to the manual for more complex stuff.
So i have task in front of me to make a custom ner model for the pharmaceutical industry where in i have a finite list of drugs and over 4000 text files from where NER is supposed to be done. I have also tried entity matching using spacy but it is showing some error. So now i plan on using SKlearn crfsuite but in order to do that my data needs to be in conll format and should be annotated.Would really appreciate if someone could guide me in annotating my text files! is there any way i can initiate automatic annotation on the text files using the drug list i have ? as it is a humongous effort for an individual to achieve the same manually.I also had a look at the question asked in the link mentioned below.
NER model to recognize Indian names
But no one has actually addressed my question.Would really appreciate if someone could help me out
Spacy code:-
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
data=pd.read_excel(r'C:\Users\xyz\pname.xlsx')
ld=list(set(data['Product']))
nlp = spacy.load('en')
entity_matcher = EntityMatcher(nlp, ld, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
doc=nlp('Hi bnbbn, ope all is well. In preparation for the bcbcb is there anything that BGTD requires specifically? We had sent you the US centric Briefing Package to align with our previous discussion on having bkjnsd included in the Wave 1 IMOVAX POLIO submission plan. If you would like, we can set-up a BGTD specific meeting after the June 20th meeting to discuss any jk specific product questions you may have as the product mix is a bit different between countries.')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
when i run my script, this is the error i get :-
[T002] Pattern length (11) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.
{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).
I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized