Human annotation tool for corpora in NLP [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am trying to build my own training corpus for Named Entity Recognition, but I don't know if there is already an existing tool for this or if I have to implement one myself.
Basically, what I need to do is take a corpus and manually tag it word by word, which is pretty tedious, but it has to be done.
Can anyone tell me if there is already an existing one and where to get it?

I had a good experience working with BRAT.
GATE is also a very complex tool for annotating, steeper learning curve.

We had a nice experience using DataTurks . They provide nice intuitive UI which allows to add collaborator, insights into data, leaderboard for annotators and some other funky features.
https://dataturks.com

For online annotation of text or HTML corpus of relatively short documents I also recommend BRAT. You will have to go under the hood of the python web application if you want to do anything custom. It also failed to work for me on large HTML documents (100 or so pages).
I have also used stand-alone apps:
Protege + Knowtator: a bit cumbersome to setup / use, but it
works;
Gate: also cumbersome, and it somewhat works. Backup
your annotations at regular intervals as you might get
surprised by a stacktrace that also wiped or corrupted your annotated
corpus (which is just serialized Java objects).
If you are dealing with PDF documents, we built a web-based PDF Annotation Tool: NOTA. It accepts anything printed to PDF, including scans. We do commercial OCR on our end to recover text from images. There is a REST API to create color-coded annotation schemas and pre-populate documents with annotations, as well as a REST API for exporting formatted text and annotation offsets. There is also a JS API you can use to customize any annotation workflows, add metadata to annotations, etc. Relationships are not supported out of the box. Large documents, 200+ pages are supported. Email us and we can give you an API key to try it out. Details and documentation links can be found here. It is free for small research projects.
Here is a screenshot of what the annotations looks like :

I co-develop myself the web-based text annotation tool: tagtog.net
There is nothing to install, and you can define the type of entities you want to annotate. Additionally you can annotation relationships, document labels, and much more. You can upload your documents in many different formats, including PDF or markdown. You can annotate together with your team collaboratively. We have put great care in making the interface easy and beautiful. It looks like this:
You can start right away with a free account. Also I would be happy to help you with any doubt or issue you may have; just ping me or write us an email to the address shown on the website, tagtog.net.

Our annotation tool Prodigy is very scriptable, and is designed for active learning. It integrates especially well with our NLP library spaCy.
We've paid particular attention to the Named Entity Recogntion (NER) annotation workflows, as entity recognition can otherwise be very slow. I have a tutorial video on this:
https://www.youtube.com/watch?v=l4scwf8KeIA

There is this tool called, Dataturks is super simple to use, fully online NLP annotation tool, so that I even can easily push my teammates to complete datasets for our projects.

try TagEditor ,
It is a desktop application designed to annotate text for training with spaCy library.
You can tag Named Entities, Dependencies, Parts of speech, text categories
and print json file.
Example

Related

Create an interactive map in iOS [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need to create an interactive map in iOS.
I have to do a thing like Expo app.
An image or a map view in background and draw the way between 3-4 points.
I don't know if use Google Maps, Apple Maps, images or something else.
Developing a spatially aware application is no trivial matter on any platform. It will require careful planning and architecture design UP FRONT or you'll find yourself doing a lot of "extreme programming" (tons of refactoring). In order to develop a spatially aware application you will need several items:
A familiarity with a map API. Apple's MapKit API is fine, but there are others such as Mapbox which offer additional services such as
offline caching, custom basemaps, etc.
A custom basemap: The basemap you're seeing here is certainly a custom job and probably not open source, so you'll need to come up
with one of your own. Unfortunately, every map API has a different
approach to this so you'll need to do some research to determine the
right solution for your API.
Map features: You'll need to understand how to add features to your map. Some APIs call these Annotations, while others simply call
them Features (like ESRI). In either case, you will need to generate
your own feature geometry using the Core Location API and whatever components the map API utilizes. You will also need to create custom graphics for these annotations,
unless you can find something suitable in the public domain. If you
intend to add polylines (for directions) or polygons (to highlight an
area) you will also need to define your own custom symbology (line
color, width, fill colors, etc). Again, not every API uses the term
symbology to describe these details but hopefully you get the idea.
Data storage: You'll need to decide how you're going to store and retrieve data for the mapview. You can store everything online in a
custom web service. You could also use something like the Parse API
if you don't have the resources for your own web service.
Alternatively, you could store everything locally in a SQLite
database or using Core Data. In either case, you will need to have a
plan for querying the location data in an efficient manner. SQLite
supports R*Tree indices which are a good way to store a geometry's
bounding box (envelope) information, but you still need to roll your own INSERT and SELECT queries. Most likely you'll need to come up with some combination of the two.
Learn the language: Overall, you absolutely must learn the language of the map APIs. Its vital that you are familiar with the
language of spatially aware applications, including the fundamentals
of location technology, if you intend to be successful in this
project. I would suggest beginning to do some research into the iOS
MapKit API, and maybe an open source solution like Mapbox. Learning geoJSON isn't a bad idea even if you don't intend to use it in your app. It is very simple and could help you learn a lot about spatial technology very quickly.
As you can see, there's a LOT going on in a spatially aware application, and this list is just a starting point. I am not trying to dissuade you from your goal, but just be aware that this isn't a "drag and drop" sort of project.

PDF Parsing with Text and Coordinates [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I am currently using PDF Box to parse a pdf and I am trying to figure out how to retrieve data about the text such as the font (bold, size, etc) and the location of the font.
Any suggestions?
After poking around the (hard to find) PDFBox docs, I found this little gem.
Apparently one of the examples shows exactly how to do everything you asked. Basically, you subclass PdfTextStripper and override the processTextPosition method. There, you query the TextPosition for whatever information you need.
For future reference, you can find the javaDoc here: http://pdfbox.apache.org/apidocs/index.html
Edit 2018-04-02: original link is dead, but example can be found in the SVN repo here.
One of the best things for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.
PDFlib.com is Thomas Merz's (the author of the "PostScript and PDF Bible") company.
TET's first incarnation is a library. That one can probably do everything you want, including to positional information about each text element on the page. Oh, and it can also extract images. It recombines+merges images which are fragmented into pieces.
pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. Obviously you'd need Acrobat as well to make use of this.
And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user workstations. Both this is free (as in beer) to use for private, non-commercial purposes.
Lastly, TET also comes with a commandline interface.
TET is really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.
A few months ago I tested their desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.
This thing is my recommendation for every sophisticated and challenging PDF text extraction requirements.
TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...
Give it a try.
The GetPageText function with extract option 3 or 4 in Quick PDF Library returns a CSV string for the selected page which includes the text (either individual words or a piece of text) and the related font name, text color, text size and co-ordinates on the page.
Note: it is a commercial library and I work for the company that sells it.
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.

free visual editor for graph (dot) files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Is there a free (as in "cheers"), linux-compatible, interactive visual editor for graphviz or other graphs? aptitude seems to be drawing a blank.
edit: "cheers" means both "beer" and "speech". meta-edit: I guess it should be "free as in beach".
edit 2: Maybe a suitable svg editor would be a more realistic goal. I basically want something that can be used to conveniently create a collection of labeled shapes and lines which connect them. Actually it would probably make more theoretical sense to extract the graph from this data, since it includes both semantic data (the graph) and presentation data (the way it's arranged on the screen, the colours used, etc). Is there a way to lay out labeled shapes conveniently with inkscape or some other free vector graphics editor? I really need rearranging of the nodes, and (re)flowing of the text in them, to happen with maximum convenience.
I've also realized that this is really a superuser question. I was going to repost it over there when I found an existing question that seems likely to provide me with an answer: dia.
edit 3: dia seems useful except that it doesn't seem to be possible to get the textual contents of node objects to wrap in any useful manner (ie any way other than by inserting manual line breaks). This is kind of a dealbreaker, since it screws most of the convenience factor that's my incentive to do things this way rather than with a text editor or a pen and paper. But it supports some sort of event model and Python-based scripting, so I'm going to dig around a bit and see if I can use python to wrap the text in response to content changes. Unless one of you lovely people has a better idea..? Basically I want to have the option to explicitly set the node size via GUI interaction, and have the contents wrap and rescale (within a certain range of font sizes) to fit it. Rich text would be pretty useful.
In other words, this is actually a valid SO question at this point, since it appears to require coding.
To save time those eager to try existing programs handling DOT graphs:
dotty can display DOT graphs and with little luck you can move its nodes with a mouse, nothing more, and you can easily segfault as a bonus (I tried latest stable graphviz)
lefty is only a special-purpose language interpreter used by dotty, nothing to look at
KGraphEditor is an empty wishful project (a QT window and a few buttons)
gvedit is not really a graph editor: it provides a simple text editor and you hit F5 to run a layout tool and open a picture; you can actually get more functionality from configuring your own favourite text editor
grappa is an abandoned java applet, which I failed to run
interestingly, dia can export to DOT ("PyDia DOT Export"), but due to its buggy printing, you have to post-process the files to use them
graphedit can read in DOT a graph and you can move its nodes around and change their colors
Eclipse people started working on DOT support in GEF4, so it can display DOT graphs
GraphUI has a very interesting demonstration video, but beware: although it might seem that the graph is being created by clicking and dragging, in reality all editing happens through the keyboard, using shortcuts. On the plus side, contextual instructions are always available showing which shortcuts do what.
DotEditor claims a tree editor, modifying node attributes/color/shape with mouse.
The graph editors mentioned in other answers, yEd (a Java application) and JointJS/Rappid (a JaveScript thing) apparently have nothing to do with DOT (tried both).
I believe there exist no working DOT-handling graph editor out there at all.
JointJS is a Javascript graph editing library based on Backbone : http://www.jointjs.com/
The author also provides Rappid, an online graph editor which might suit your needs, I don't know about dot files import though.

Music analysis software [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Greetings
I may have imagined this but does anyone know if Last.fm previously used some form of open source project to perform analysis on music to determine similar music.
As its now moved to a pay version I'd like to make something which can add known music to my playlist. (I hate scanning my computer for similar music manually)
Failing that - does anyone know of any system that I could use to replace this ? Ideally I'd like some form of API / Source code that I can use to automate the whole process into batch jobs.
Thanks,
[edit]
Ideally I was looking for something more along the lines of content matching. I'm the type of person who just throws all my music into one unorganized location. Then being lazy I would ideally expect a playlist to be generated giving me a similar music type of playlist.
Last.fm uses http://www.audioscrobbler.net/ - it also provides access to its database via an API.
[/edit]
Music similarity is not an easy problem.
There are two general approaches to solving this problem.
Approach 1.
Throw data at the problem. This is the approach LastFM and Pandora take. It's basically one huge database which is maintained by either a community or group of experts. Note that to use this approach you will need clean metadata or some kind of audio fingerprinting solution like musicbrainz. Once you have the feature database you can use algorithms such as Pearson correlation coefficient to find similar items.
Approach 2.
Throw algorithms at the problem. In particular, computer audition algorithms. This means you calculate vectors of various features a song contains and using neural nets and a variety of other techniques you find other songs with similar vectors. This approach has been used successfully for automatic genre classification and query by example.
If you are looking for open source software for music analysis, marsyas can do pretty much everything the commercial stuff can do. Its the brain child of George Tzanetakis and on his web site you can find many papers about the state of affairs with computer audition.
There's a web API at The Echo Nest that includes a get_similar web service that allows you to retrieve similar artists to a set of seed artists. You can use this to help build playlists. The Echo Nest also has a set of web APIs that will perform a detailed analysis of a track (similar to the aforementioned Marsyas) that one could use as the basis for an acoustic-based song similarity method. (Caveat, I work at the Echo Nest). Of course, if you use iTunes, there's some canned solutions. iTunes now has a music recommender / playlist generator that will build playlists of songs from simliar artists. Similarly, the company Mufin has an iTunes add on which will perform acoustic analysis of your tracks and use this analysis to build playlists.
If you are interested in building your own music similarity system, I suggest that you take a look at the proceedings for ISMIR (the International Society of Music Information Retrieval). There's quite a bit of research around music similarity and playlisting that you'll find helpful. You can find the proceedings at ismir.net
Wouldn't it be simpler/more efficient to query(build?) some internet database based on genre/style/etc? I used last.fm and similar sites but never felt they did anything more then this (at least the results weren't indicating that) ;)
I am not very sure what exactly you want, but how about MusicBrainz?
To be clear, AudioScrobbler is the tech built by Last.fm to run their service. They collect stats on the tracks which people listen to (also 'Like's of tracks and artists).
So Last.fm does social similarity... users who listened to X also listened to Y - you like X so maybe you will also like Y.
Given a large enough user base submitting stats, social similarity is likely to provide better results than computer analysis approaches. For example, try querying the Last.fm API for similar artists to someone you know - probably comes up with some good matches and a few obscure or oddball ones, which nonetheless reflect real people's listening habits. The more obscure the artist you search for the more likely you'll get weird matches.
Even if you could get the automatic genre classification method described by George Tzanetakis to work well you are missing out on the subjective judgements of quality supplied by real people. eg two tracks both look like 'Jazz' but there are many different kinds of Jazz... and I might be interested in non-Jazz albums that a favourite jazz musician has played on. Social similarity would be more likely to capture that info.
I used to use Predixis Magic Mixer. It will perform a brief analysis of the audio in a file, produce a "finger print" and compared it to fingerprints in a central database. If listed, it would set an identification code which is the result of the analysis of the entire file into the client copy. If not, it would do a full analysis on the client computer (takes a while) and upload that to the central database and keep the local copy as well. From that information it can set up a play list that relates tunes, one to another' depending upon the actual sounds. I have not used it for a few years so I don't know if the central database servers still are in operation, but a web search says no. It should still work, but every file will require full analysis.

What do you use to capture webpages, diagram/pictures and code snippets for later reference? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
What do you use to capture webpages, diagram/pictures and code snippets for later reference?
Evernote http://www.evernote.com and delicious http://www.delicious.com
Evernote
Notepad2's clipboard feature (Notepad2.exe /c as a link in Launchy)
Windows Clippings or PrintKey
Firefox extension Page Saver
Delicious
Microsoft OneNote.
I just have an emacs instance running on my home machine, under screen. Whereever I am (and have network) I can connect to it remotely. I stick all useful urls, birthday present ideas, future dates, code snippets, ideas for docs etcetc in there.
I rarely have doodles/diagrams I need to capture, I tend to draw them in ascii in my file if needed.
I must admit I'm a bit stuck if I have no network/wifi somewhere, but that's rarely the case.
I find google notebook is very good for drive by code snippeting and google bookmarks especially as when used with the google toolbar, for web pages.
The benefit of these tools are that they are available from any pc on the web, though a good use of semantic organisation using labels is recommended.
Here's my response to a similar question:
The combination of OneNote with a tablet PC is awesome! I was a bit of a skeptic at first. I used the trial version and then forgot about it. A year later I had an unruly collection of files, project related emails, notebooks and scraps of paper all scattered throughout my life. I went back to OneNote and all my problems went away. Some highlights:
Everything is searchable. The character recognition is good enough that my chicken-scratch meeting notes can be searched. Text within images is searchable.
OneNote syncs with Outlook so finding meeting notes is a breeze.
I now embed all files into OneNote - pdfs, spreadsheets, word docs, images, web clippings.
OneNote is constantly saving all changes so, combined with a scheduled automated backup, everything is in one place and is safe.
There are some built-in collaboration tools I have yet to try but that look useful.
It is SO worth the price. It allows you to get started on a project and avoid all that time spent deciding how to organize things.
Zotero, is a nice plugin for Firefox.
SnagIt
captures everything you could want, and lets you annotate it.
I prefer to use the good old url for delicious
Apart from that i use the Scrapbook extension in firefox when i want to save something on the disk. It's possible to tag the page, edit it and remove those stupids ads before saving it.
I also have a Wiki on a stick that i carry around on a usbkey for code snippets that should go to other clients when i'm travelling around
Mostly, my code snippets are embedded into projects i carry on the same usb key, which allows me to demonstrate some technologies right off to the client and get his advice based on a demonstration, not a listing of code...
For screen shots, I use a mix between ScrapBook and ScreenGrab. They are both firefox plugins that are pretty amazing when you need to get a screenshot of a page for editing. Works great for consulting.
https://addons.mozilla.org/en-US/firefox/addon/427
https://addons.mozilla.org/en-US/firefox/addon/1146
Delicious Bookmarks extension for Firefox
It's a little primitive, but I've been using tiddlywiki (self-contained, single-file wiki) http://www.tiddlywiki.com/ which works good for basic text and markup. I combine it with a plugin to sync it with Outlook's notes (http://syncoutlooknotes.tiddlyspot.com/#SyncOutlookNotes) so that I can then sync it to my blackberry using the standard outlook-blackberry sync mechanism. This has the significant advantage that I can look at my notes and even write new notes when I'm out and about, away from my laptop, or just don't feel like lugging the laptop around to a meeting that I don't really need it for.
I'd prefer using something more advanced like Onenote, but being able to take my notes with my in the little blackberry has turned out to be a significant advantage.
Google Notebook is very convenient tool. You can clip and save any parts of web pages without leaving your browser tab. The Notebook plug-in automatically saves them as separate notes in your notebooks and keep the links back to the original web pages. You can organize your clippings later by moving them between your notebooks and/or tagging them. Very good for code snippets and references.

Resources