Low score with currency Entities to Entity Extraction in IBM Watson NLU - named-entity-recognition

I´m trying to extract some entities and relations from text documents using NLU and WKS. I got good results, but I would like to understand why Watson NLU do not recognize some entities of my custom model in similar documents, for example:
Text 1 in Portuguese: "Dá à causa o valor de R$ 10.000,00" => DIDN´T WORK
Text 2 in Portuguese: "Dá à causa o valor de R$ 20.000,00" => WORKED!
Text 3 in Portuguese: "Dá à causa o valor de R$ 10.000,01" => WORKED!
Watson recognize my entities and relations on Text 2 and Text 3 but do not in Text 1. The same thing happens with:
Text 4 in Portuguese: "Dá à causa o valor esperado de R$ 20.000,00" => DIDN´T WORK
Text 5 in Portuguese: "Dá à causa o valor de R$ 20.000,00" => WORKED!
A sample of document tagged:
Dataset:
Training set: 250 documents (85%)
Test set: 35 documents (12%)
Blind set: 10 documents (3%)
I already used anothers splits.
All documents have the entities and relation, once by document, with variances.
I already tagged more documents with this scenario, but it didn´t improve the results. Another test was to tag any currency into the documents.
What can I do to improve the results?

Related

Complex multirow/multicolumn table

I am trying to create a complex table to define a lot of concepts, that can have subconcepts with their own definitions.
Right now, my problem is controlling the width of every column of the table. Because the definitions use more than one line and can get pretty long, LaTeX does not break the line like I'm used to do at normal tables. For that, I tend to use p{0.X\columnwidth} instead of l/c/r, but it doesn't seem to work here.
Right now, the goals to attain for this table are:
Make it autofit content inside the cell in an automatic/semi-automatic way (i.e.: without manually doing a line break);
Vertically and horizontally center every cell that is not a definition;
Make the whole table fit inside a page.
Here is a small example already built:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{multirow}
\begin{document}
\begin{table}[]
\begin{tabular}{p{0.1\columnwidth}p{0.1\columnwidth}p{0.1\columnwidth}p{0.5\columnwidth}}
\hline
\multirow{11}{*}{A} & \multicolumn{3}{l}{Definition of A} \\ \cline{2-4}
& \multirow{5}{*}{A1} & \multicolumn{2}{l}{Definition of A1} \\ \cline{3-4}
& & A11 & Definition of A11 \\ \cline{3-4}
& & A12 & Definition of A12 \\ \cline{3-4}
& & A13 & Definition of A13 \\ \cline{3-4}
& & A14 & Definition of A14 \\ \cline{2-4}
& A2 & \multicolumn{2}{l}{Definition of A2} \\ \cline{2-4}
& \multirow{4}{*}{A3} & \multicolumn{2}{l}{Definition of A3} \\ \cline{3-4}
& & A31 & Definition of A31 \\ \cline{3-4}
& & A32 & Definition of A32 \\ \cline{3-4}
& & A33 & Definition of A33 \\ \hline
\multirow{4}{*}{B} & \multicolumn{3}{l}{Definition of B} \\ \cline{2-4}
& B1 & \multicolumn{2}{l}{Definition of B1} \\ \cline{2-4}
& B2 & \multicolumn{2}{l}{Definition of B2} \\ \cline{2-4}
& B3 & \multicolumn{2}{l}{Definition of B3} \\ \hline
\multirow{3}{*}{C} & \multicolumn{3}{l}{Definition of C} \\ \cline{2-4}
& C1 & \multicolumn{2}{l}{Definition of C1} \\ \cline{2-4}
& C2 & \multicolumn{2}{l}{Definition of C1} \\ \hline
\end{tabular}
\end{table}
\end{document}
Unfortunately multirow can't deal well with this kind of table, because it doesn't get information of LaTeX's tabular about the heights of the table cells. So you would have to count the actual number of lines that the cells occupy, which is annoying.
This kind of table is much easier to do with the tabularray package, which has its own implementation of the table mechanism which does have this kind of information available. Actually it is possible to use most of your code by just replacing \usepackage{tabular} with \usepackage{tabularray} and the tabular environment with tblr. However, althoug the \multirows will be properly positioned, you will not get the full power of tabularray, for example the entries will not be horizontally centered. Moreover, support for \multirow and \multicolumn in tabularray will disappear in a future version. So I rewrote it to use the syntax that tabularray has for these cases with \SetCell. A few remarks to begin with:
In tabular you would use expressions like 0.7\textwidth+2\tabcolsep to calculate the width of a \multicolumn. However, tabularray doesn't use \tabcolsep. Instead it has leftsep and rightsep parameters. These are available as \leftsep and \rightsep inside a cell, but unfortunately they are not set outside the cell, for example in a width specification. Therefor I (ab)used \tabcolsep in the width calculation, and set colsep=\tabcolsep which sets both leftsep and rightsep.
tabularray uses \raggedright internally for l and p columns, but with the standard LaTeX definitions this makes hyphenation very difficult. Therefore I used the ragged2e package to get better definitions for this. It has to be included before tabularray to work properly.
I defined commands \Mdois and \Mtres for the \multicolumn{2} and {3} cells. I included the \justifying command (from the ragged2e package) to get fully justified text. You can leave this out if you don't want the text to be justified.
I included cells={c,m} to get the cells both horizontally and vertically centered by default. I also used m{} type column specifiers for vertical centering in the larger text columns.
I used rowsep=1pt to get the rows closer to each other. Still I had to use \small to get everything fit on the page. Actually the table is sticking into the page footer, so you might have to use an even smaller font size, like \footnotesize.
TeX doesn't hyphenate the first word of a paragraph, unless you include something like \hspace{0pt} before it. I made a macro \HH for this and put that before some large words like "Confidencialidade". You could actually put that in the column definition if you want.
The texts that were originally in the \multirow or \multicolumn still have braces {...} around them although that is not longer necessary (they are not parameters of the \SetCell, \Mdois or \Mtres. I thought it was too much work to remove them and they don't harm.
So here is the solution. I reduced it to the minimum necessary for the table.
\documentclass[sigplan]{acmart}
\usepackage[portuguese]{babel}
\usepackage{calc}
\usepackage[newcommands]{ragged2e}
\usepackage{tabularray}
\UseTblrLibrary{booktabs}
\begin{document}
\begin{table*}[tbp]
\newcommand\HH{\hspace{0pt}}
\NewTableCommand{\Mdois}{\SetCell[c=2]{preto=\justifying,wd=0.7\textwidth+2\tabcolsep}}
\NewTableCommand{\Mtres}{\SetCell[c=3]{preto=\justifying,wd=0.8\textwidth+4\tabcolsep}}
\SetTblrInner{rowsep=1pt}
\small
\centering
\caption{Especificação das Métricas do CVSS}
\label{tab:metricas-cvss}
\begin{tblr}{colspec={ m{0.1\textwidth} m{0.1\textwidth} m{0.1\textwidth} Q[preto=\justifying,wd={0.6\textwidth}] },colsep=\tabcolsep,cells={c,m}}
\toprule
\SetCell[r=11]{c}{\textbf{Métricas de Base}} & \Mtres {Todas as métricas que servem de base para o cálculo da vulnerabilidade, de forma preliminar.} \\ \cline{2-4}
& \SetCell[r=5]{c}{\textbf{Métricas de Explorabilidade}} & \Mdois {As métricas de explorabilidade refletem valores relativos ao componente afetado, bem como propriedades de uma vulnerabilidade que leve a cabo um ataque bem-sucedido.} & \\ \cline{3-4}
& & \SetCell[r=1]{c,m}\textbf{Vetor de Ataque (AV)} & \SetCell[r=1]{c,m}
Contexto pela qual a exploração da vulnerabilidade é possível - o valor será tanto maior quanto mais um atacante conseguir atingir (lógica e fisicamente) para explorar o componente vulnerável. \\ \cline{3-4}
& & \HH\textbf{Complexidade do Ataque (AC)} & Condições para além do controlo do atacante que devem existir para explorar a vulnerabilidade. \\ \cline{3-4}
& & \textbf{Privilégios Necessários (PR)} & Nível de privilégios que um atacante deve possuir antes de explorar a vulnerabilidade com sucesso – quantos menos privilégios forem necessários, mais alta será a pontuação. \\ \cline{3-4}
& & \textbf{Interação do Utilizador (UI)} & Captura a necessidade de um utilizador humano, diferente do atacante, participar no comprometimento bem-sucedido do componente vulnerável \\ \cline{2-4}
& \textbf{Contexto (S)} & \Mdois {Captura se uma vulnerabilidade num componente vulnerável afeta os recursos noutros componentes para além do seu contexto de segurança.} \\ \cline{2-4}
& \SetCell[r=4]{c}{\textbf{Métricas de Impacto}} & \Mdois {Capturam os efeitos de uma vulnerabilidade explorada com sucesso no componente que sofre o pior resultado que está mais direta e previsivelmente associado ao ataque.} \\ \cline{3-4}
& & \HH\textbf{Confidencialidade (C)} & Mede o impacto sobre a confidencialidade dos recursos de informação geridos por um componente de software graças a uma vulnerabilidade explorada com sucesso – a classificação é maior quando a perda para o componente afetado é maior. \\ \cline{3-4}
& & \textbf{Integridade (I)} & Mede o impacto na integridade de uma vulnerabilidade explorada com sucesso – a classificação é maior quando a consequência para o componente afetado é maior. \\ \cline{3-4}
& & \HH\textbf{Disponibilidade (A)} & Mede o impacto na disponibilidade do componente afetado resultante de uma vulnerabilidade explorada com sucesso. Como a disponibilidade se refere à acessibilidade dos recursos de informação, os ataques que consomem largura de banda da rede, ciclos do processador ou espaço em disco impactam a disponibilidade do componente afetado – a classificação é maior quando a consequência para o componente impactado é maior. \\ \hline
\SetCell[r=4]{c}{\textbf{Métricas Temporais}} & \Mtres {As métricas temporais medem o estado atual das técnicas de exploração ou disponibilidade de código, a existência de \textit{patches} ou uma solução de recurso (comumente designado em inglês por \textit{workaround}), ou a certeza da descrição da vulnerabilidade} \\ \cline{2-4}
& \textbf{Exploração da Maturidade do Código (E)} & \Mdois {Mede a probabilidade de a vulnerabilidade ser atacada e é normalmente baseada no estado atual das técnicas de exploração, disponibilidade de código de exploração ou exploração ativa \textit{in-the-wild}.} \\ \cline{2-4}
& \textbf{Nível de Remediação (RL)} & \Mdois {É um fator importante para a priorização - quanto menos oficial EXPLICAR O QUE É OFICIAL?! e permanente for uma correção, maior será a pontuação de uma vulnerabilidade.} \\ \cline{2-4}
& \textbf{Confiança no Reporte (RC)} & \Mdois {Mede o grau de confiança na existência da vulnerabilidade e na credibilidade dos detalhes técnicos conhecidos - quanto mais uma vulnerabilidade é validada pelo fornecedor ou outras fontes confiáveis, maior será a pontuação.} \\ \hline
\SetCell[r=3]{c}{\textbf{Métricas Ambientais}} & \Mtres {As métricas ambientais permitem a personalização da pontuação do CVSS, dependendo da importância do ativo de TI afetado para a organização de um utilizador, medida em termos de controlos de segurança complementares e/ou alternativos implementados, confidencialidade, integridade e disponibilidade.} \\ \cline{2-4}
& \textbf{Requisitos de Segurança (CR, IR, AR)} & \Mdois {Permitem que o analista personalize a pontuação CVSS, dependendo da importância do ativo de TI afetado para a organização de um utilizador, medida em termos de confidencialidade, integridade ou disponibilidade - cada requisito de segurança tem três valores possíveis: Baixo, Médio ou Alto.} \\ \cline{2-4}
& \textbf{Métricas de Base Modificadas} & \Mdois {Permitem que o analista substitua as métricas básicas individuais com base em características específicas do ambiente de um utilizador; delas fazem parte a Modified Attack Vector (MAV), Modified Attack Complexity (MAC), Modified Privileges Required (MPR), Modified User Interaction (MUI), Modified Scope (MS), Modified Confidentiality (MC), Modified Integrity (MI) ou Modified Availability (MA).} \\
\bottomrule
\end{tblr}
\end{table*}
\end{document}

latex url not formatted [duplicate]

I'm trying to break a long url using latex.
I have 3 links, the first one that contains hyphens doesn't work but the two others work because they don't contain hypens.
I used \url(the_url_to_brak) like this :
\hline
\textbf{Documentation} & Riche et peut être téléchargée gratuitement sur \url{https://www.ssi.gouv.fr/guide/ebios-2010-expression-des-besoins-et-identification-des-objectifs-de-securite/}
& Riche et peut être téléchargée gratuitement sur \url{https://clusif.fr/management_des_risques/}
& Catalogue de pratiques de sécurité et d’autres documents peuvent être téléchargés gratuitement sur \url{https://resources.sei.cmu.edu/library/asset-view.cfm?assetID=309051}\\
This is the output :
Can you help me please ?
The xurl package will add more possible breaks points:
\documentclass{article}
\usepackage{url}
\usepackage{tabularx}
\usepackage{xurl}
\begin{document}
\begin{tabularx}{\linewidth}{XXXX}
\hline
\textbf{Documentation} & Riche et peut être téléchargée gratuitement sur \url{https://www.ssi.gouv.fr/guide/ebios-2010-expression-des-besoins-et-identification-des-objectifs-de-securite/}
& Riche et peut être téléchargée gratuitement sur \url{https://clusif.fr/management_des_risques/}
& Catalogue de pratiques de sécurité et d’autres documents peuvent être téléchargés gratuitement sur \url{https://resources.sei.cmu.edu/library/asset-view.cfm?assetID=309051}\\
\end{tabularx}
\end{document}
For such relatively narrow columns, it might look better to left align the text instead of justifying:
\documentclass{article}
\usepackage{url}
\usepackage{array}
\usepackage{tabularx}
\usepackage{xurl}
\newcolumntype{Y}{>{\raggedright\arraybackslash}X}
\begin{document}
\begin{tabularx}{\linewidth}{YYYY}
\hline
\textbf{Documentation} & Riche et peut être téléchargée gratuitement sur \url{https://www.ssi.gouv.fr/guide/ebios-2010-expression-des-besoins-et-identification-des-objectifs-de-securite/}
& Riche et peut être téléchargée gratuitement sur \url{https://clusif.fr/management_des_risques/}
& Catalogue de pratiques de sécurité et d’autres documents peuvent être téléchargés gratuitement sur \url{https://resources.sei.cmu.edu/library/asset-view.cfm?assetID=309051}\\
\end{tabularx}
\end{document}

RandomForestSRC error and vimp

I am trying to perform a randomforest survival analysis according to the RANDOMFORESTSRC vignette in R. I have a data frame containing 59 variables - where 14 of them are numeric and the rest are factors. 2 of the numeric ones are TIME (days till death) and DIED (0/1 dead or not). I'm running into 2 problems:
trainrfsrc<- rfsrc(Surv(TIME, DIED) ~ .,
data = train, nsplit = 10, na.action = "na.impute")
trainrfsrc gives: Error rate: 17.07%
works fine, however exploring the error rate such as:
plot(gg_error(trainrfsrc))+ coord_cartesian(y = c(.09,.31))
returns:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
or:
a<-(gg_error(trainrfsrc))
a
error ntree 1 NA 1 2 NA 2 3 NA 3 4 NA 4 5 NA 5 6 NA 6 7 NA 7 8 NA 8 9 NA 9 10 NA 10
NA for all 1000 trees.how come there's no error rate for each number of trees tried?
the second problem is when trying to explore the most important variables using VIMP such as:
plot(gg_vimp(trainrfsrc)) + theme(legend.position = c(.8,.2))+ labs(fill = "VIMP > 0")
it returns:
In gg_vimp.rfsrc(trainrfsrc) : rfsrc object does not contain VIMP information. Calculating...
Any ideas? Thanks
Setting the err.block=1 (or some integer between 1 and ntree) should fix the problem of returning NA for error. You can check the help file under rfsrc to read more about err.block.

How do I correct a bad sentence alignment (parallel corpus)?

I have a parallel corpus (pt-en) but it has some alignment mistakes. Usually because portuguese sentences are longer and when a translation is done by humans (gold standart) they cut the sentences in half for better understanding. This doesn't happen very offten but I need a perfect alignement to train my machine.
Example: (1 - 1);(2, 3 - 2);(4 - 3);(5 - 4)
EN
1 - Once the individuals at high risk for events are identified, we propose that they be treated according to the current prevention guidelines, mainly in regard to the use of statins and acetylsalicylic acid2.
2 - We suggest that further studies approach the determination of coronary artery calcium scores in women, assess a greater number of males at more advanced ages and of different social classes than that of the population studied (middle-class individuals).
3 - We also suggest that the coronary calcium scores reported for Brazilian populations should be compared with those of populations in the USA, because these are the patterns available in the literature.
4 - If differences are found, a prospective study about the value of coronary artery calcification in our population should be carried out.
5 - Our study is a starting point.
PT
1 - Uma vez identificados os indivíduos sob risco elevado de eventos, propomos que estes sejam tratados conforme as diretrizes atuais de prevenção, principalmente, no que se refere ao uso de estatinas e ácido acetilsalicílico2.
2 - Para seguimento de nosso trabalho, tornam-se necessários determinações em mulheres, número maior de homens em idades mais avançadas e de classes sociais diferentes da população estudada (indivíduos de classe média), comparação dos escores de cálcio coronariano descritos em populações brasileiras aos dos EUA, já que esses são os padrões disponíveis na literatura.
3 - Caso haja diferenças, seria importante um estudo prospectivo do valor da calcificação arterial coronariana em nossa população.
4 - Nosso estudo é um início para esse processo.
Thanks in advance!

How to decode these characters? á é í

I'm querying the MediaWiki API to get Wikipedia data into my Filemaker database. When I load the data into a browser, the characters show up properly but when it comes into Filemaker, characters with diacriticals get converted to these odd characters: á is converted to √° (square root symbol + degree symbol), é is converted to √© (square root symbol + copyright symbol), í is converted to √≠ (square root symbol + not equals symbol) and more. What character encoding is that? Thank you!!
As #Joni suggests in his comment, this is UTF-8 misinterpreted as MacRoman. Letter á is C3 A1 (hex.) in UTF-8, and C3 is “√” in MacRoman, A1 is “°”. So you should just try to set the program to interpret the data as UTF-8.
I'm sure this isn't the full list, but it did what I needed. Here is a lookup for the codes:
√© é e
√° á a
√≠ í i
√≥ ó o
√∂ ö o
√º ü u
√¥ ô o
√® è e
√ß ç c
√± ñ n
√∏ ø o
√´ ë e
√§ ä a
√• å a
√Å Á A
√∫ ú u
√ª û u
√Ø ï i
√â É E
√† à a
√¶ æ ae
√Æ î i
√¢ â a
√£ ã a
√î Ô O
√ü ß ss
√ì Ó O
√≤ ò o
√Ω ý y
√ñ Ö O
√™ ê e
√Ä À A
√ò Ø O
√Ö Å A
√∞ ð eth
√á Ç C
√Ç Â A
√π ù u
√í Ò O
√¨ ì i
√ú Ü U
√à È E
√û Þ Th
You're all correct about the misinterpreted characters, the Troi URL FMP plugin I was using to set FMP's user agent (as MediaWiki API requires) was responsible for pulling in the garbled characters. Solution was to bypass the plugin: FMP script performs Applescript "do shell script curl -A" to set user agent and query API and pull response back into FMP and all characters come through properly!

Resources