Text Classification of News Articles Using Spacy - machine-learning

Dataset : Csv files containing around 1500 data with columns (Text,Labels) where Text is the news article of Nepali Language and Label is its genre(Health, World,Tourism, Weather) and so on.
I am using Spacy to train my Text Classification Model. So far, I have converted the dataset to a dataframe which looks like this
and then into a spacy acceptable format through the code
dataset['tuples'] = dataset.apply(
lambda row: (row['Text'],row['Labels']), axis=1)
training_data = dataset['tuples'].tolist()
which gives me the list of tuples in my training dataset like [('text...','label...'),('text...','label...')]
Now, how can I do text classification here?
In the spacy's documentation, I found
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
Do we have to add the labels according to the labels or should we use positive/negative as well? Does spacy generate the labels according to our dataset after training or not?
Any suggestions please?

You have to add your own labels. So, in your case:
textcat.add_label('Health')
textcat.add_label('World')
textcat.add_label('Tourism')
...
spacy then will be able to predict only those categories, that you added in the above block of code
There is a special format for training data: each element of your list with data is a tuple that contains:
Text
Dictionary with one element only. cats is a key and another dictionary is a value. That another dictionary contains all your categories as keys and 1 or 0 as values indicating whether this category is correct or not.
So, your data should look like this:
[('text1', {'cats' : {'category1' : 1, 'category2' : 0, ...}}),
('text2', {'cats' : {'category1' : 0, 'category2' : 1, ...}}),
...]

Related

Google Sheets pulling out specific data from multiple rows and columns to put into a logistic regression function

I have a spreadsheet of multiple years of student annual writing assessment scores.
Each row is the data for one test (Test Year, Student ID, Test score with subsections, etc.).
I need to fill in each student’s data into a logistic regression model with the following variables:
SUMPRODUCT FUNCTION where I need the selected data to appear:
Spreadsheet and corresponding cells needed in logistic regression function
B Constant Y3 -16.266 [Generate a number ‘1’ to balance the sumproduct function.]
B T1AvgScore 0.911 [Student’s first year test average score] I need a function to put the data here
B T3AvgScore 2.399 [Student’s third year test average score] I need a function to put the data here
B T3SF2 0.434 [Student’s third year subsection ‘Sentence Fluency (SF)’ score] I need a function to put the data here
B T3Conv2 0.251 [Student’s third year subsection ‘Conventions (Conv)’ score] I need a function to put the data here
y* = ln(p/(1-p)) [Calculated from the above sumproduct function]
p = exp(y*)/(exp(y*)+1 [calculation for the prediction percent]
Thanks in advance for any assistance!
Well I'm not clear if I'm answering what you're looking for, but I have the formulas that pull the Average Score values from the AWA sheet for a given student number. See the tab I added to your sheet, Example-GK.
The query formula is simply:
=query(AWA,"select F where A = "&E$15&" and B = '"&D19&"' ",0)
where 15 is the specified StudentID (a numeric value, so no single quotes used), and D19 is the specific year.
I also added the ability to select the StudentID number from a dropdown list, in E15 on that sheet. Or the StudentName could be used for the selection criteria, instead of the StudentID, if that was available and easier for you to use. For now, the StudentName is ignored, since it wasn't available in the data.
Let me know if this is what you're looking for. One issue is there might be more years of data for some students. There are other ways of listing the years, which might help you. I'll see if I can add that function.
Update Sept 9,2020:
If I've understood your comments correctly, and that for each model, there is a set of constants that apply to all students (see below for the Model 3 constants), then I may have a generic set of formulae that calcute the probabilities for each student, using all three models, provided there is sufficient data for that student.
See my updated Example-GK in your sheet.
And let me know if I still haven't understood how your final probabilities are calculated from the individual student data values.

In EazyBi, How can I do a substring of a dimension name

I have a dimension that has this structure :, example: JIRA-525:Ticket Summary.
I'd like to extract the second part, but with no positive result.
I tried to create a custom field on that dimension and do a string operation, I know this won't give me expected result, but a basic string function is not working, as the grid is showing errors:
left([Concept].CurrentMember.Name,10)
What should I do differentñy?
The error message could give us some hints on what is wrong here. If you created a new calculated measure with the formula, it should return results.
Define a new calculated measure (measure not member, sometimes people mix them up) with the formula below:
[Concept].CurrentMember.AllProperties
Does that return any results with the dimension on rows? Does it have a "Name" property?
Ultimately, I would suggest creating a new calculated measure with ExtractString() function - https://docs.eazybi.com/eazybi/analyze-and-visualize/calculated-measures-and-members/mdx-function-reference/extractstring. But first see what properties are available.
If you are looking to change the dimension member names, then that can be done only with JavaScript calculated custom fields - https://docs.eazybi.com/eazybijira/data-import/custom-fields/javascript-calculated-custom-fields.

Alteryx: Creating multiple Histograms from one dataset

I have a data set that contains the following information - Date, item # and the unit price for that item on that date.What I would like to create is one histogram per item (my dataset has 17 unique items), charting the frequency of the unit prices? Is this possible in Alteryx?
What you really want is the ability to group by items within your data set. I think the closest thing to this for your specific use case is the summarize tool. You can group by item and then use the percentile operation to generate several points within the data range to add to a histogram.

Simple way to analyze data based on common key

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.
For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).
The output would be the original temperatures with their new classifications.
Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.
Thanks
CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.
If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this:
(1) Use Combine.perKey() to calculate the stats per day
(2) Use View.asIterable() to convert those into PCollectionViews.
(3) Reprocess the original input with a ParDo that takes the statistics as side inputs
(4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.
Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.
In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.

Google prediction api, Having the output as a list that have a variable size

I want to train a model that will allow me to generat a LIST of tag related to certain text, my output list will have variable size depending in the context. In the examples that i found, the model return always one output.
I am wondering if the Google prediction Api can help me and if there are any examples.
It seems this is what you will get when using a CATEGORICAL rather than a REGRESSION model. From the Google documentation: https://developers.google.com/prediction/docs/developer-guide#examples
When you submit a query, the API tries to find the category that most closely describes your query. In the Hello World example, the query "mucho bueno" returns the result
...Most likely: Spanish, Full results:
[
{"label":"French","score":-46.33},
{"label":"Spanish","score":-16.33},
{"label":"English","score":-33.08}
]

Resources