Is there a way to test the performance of AutoML Translation without having to create a glossary of <= 100 pairs of source/target words? - translation

I need to get a better translation from AR to EN than what I'm getting from Google Translate. I've been looking into AutoML Translation recently for this purpose, and created a test file of 3 entries (2 x 3 that is - 3 pairs of AR (source) and EN (target) words). In the documentation it talks about the test and validation phases requiring <= 100 pairs of words. As I'm not yet sure AutoML Translation will fit the bill, I wanted to know if there was a way to test its performance without going to all the trouble of creating a model with 100 pairs of words. Any other tips for creating a simple NodeJS file to use to translate a .txt or .html document living in my GCP bucket (or wherever else such a file should be living to enable the system to translate it) would be appreciated, too.

Related

How can I get predictions from these pretrained models?

I've been trying to generate human pose estimations, I came across many pretrained models (ex. Pose2Seg, deep-high-resolution-net ), however these models only include scripts for training and testing, this seems to be the norm in code written to implement models from research papers ,in deep-high-resolution-net I have tried to write a script to load the pretrained model and feed it my images, but the output I got was a bunch of tensors and I have no idea how to convert them to the .json annotations that I need.
total newbie here, sorry for my poor English in advance, ANY tips are appreciated.
I would include my script but its over 100 lines.
PS: is it polite to contact the authors and ask them if they can help?
because it seems a little distasteful.
Im not doing skeleton detection research, but your problem seems to be general.
(1) I dont think other people should teaching you from begining on how to load data and run their code from begining.
(2) For running other peoples code, just modify their test script which is provided e.g
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/tools/test.py
They already helps you loaded the model
model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
logger.info('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
model_state_file = os.path.join(
final_output_dir, 'final_state.pth'
)
logger.info('=> loading model from {}'.format(model_state_file))
model.load_state_dict(torch.load(model_state_file))
model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda()
Just call
# evaluate on Variable x with testing data
y = model(x)
# access Variable's tensor, copy back to CPU, convert to numpy
arr = y.data.cpu().numpy()
# write CSV
np.savetxt('output.csv', arr)
You should be able to open it in excel
(3) "convert them to the .json annotations that I need".
That's the problem nobody can help. We don't know what format you want. For their format, it can be obtained either by their paper. Or looking at their training data by
X, y = torch.load('some_training_set_with_labels.pt')
By correlating the x and y. Then you should have a pretty good idea.

How to use external data in an OSRM profile

It this Mapbox blog post, Lauren Budorick shares how they got working a routing engine with OSRM that uses elevation data in order to give cyclists better routes... AMAZING!
I also want to explore the potential of OSRM's routing when plugging in external (user-generated) data, but I'm still having a hard time grasping how OSRM's profiles work. I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
But that's it, there are plenty of missing parts in my head, like what do each of the functions Lauren uses in her profile do. If anyone could point me to some more detailed information on how all of this works, you'd make my next week much, much easier :)
Also, in Lauren's post, inside source_function she loads a ./srtm_bayarea.asc file. What does that .asc file looks like? How would one generate a file like that one from, let's say, data stored in a pgsql database? Can we use some other format, like GeoJSON?
Then, when in segment_function she uses things like source.lon and target.lat, are those refered to the raw data stored in the asc file? Or is that file processed into some standard that maps everything to comply it?
As you can see, I'm a complete newbie on routing and maybe GIS in general, but I'd love to learn more about this standards and tools that circle around the OSRM ecosystem. Can you share some tips with me?
I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
Right, every way and every node are scored as they are read from an OSM dump to determine passability of a node and speed of a way (used as the scoring heuristic).
A basic description of the data format can be found here. As it reads, data immediately available in ArcInfo ASCII grids includes SRTM data. Currently plaintext ASCII grids are the only supported format. There are several great Python tools for GIS developers that may help in converting other data types to ASCII grids - check out rasterio, for example. Here's an example of a really simple python script to convert NED IMGs to ASCII grids:
import sys
import rasterio as rio
import numpy as np
args = sys.argv[1:]
with rio.drivers():
with rio.open(args[0]) as src:
elev = src.read()[0]
profile = src.profile
def shortify(x):
if x == profile['nodata']:
return -9999
elif x == np.finfo(x).tiny:
return 0
else:
return int(round(x))
out_elev = [map(shortify, row) for row in elev]
with open(args[0] + '.asc', 'a') as dst:
np.savetxt(dst, np.array(out_elev),fmt="%s",delimiter=" ")
source.lon and target.lat e.g: source and target are nodes provided as arguments by the extraction process. Their coordinates are used to look up data at each location during extraction.
Make sure to read thoroughly through the relevant wiki page (already linked).
Feel free alternately to open a Github issue in
https://github.com/Project-OSRM/osrm-backend/issues with OSRM
questions.

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

How to tag text based on its category using OpenNLP?

I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?
I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.
This is how a BTDigg-like site works:
You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
index the information you find in there, map it to the info-hash
Provide a front-end for that index
If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.
So, the claim that you don't store .torrent files nor any content is true.
It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".
EDIT:
Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.
For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:
PDF of his paper:
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
His presentation at DEFCON 18 (parts 1 & 2)
http://www.youtube.com/watch?v=v4Q_F4XmNEc
http://www.youtube.com/watch?v=mO3DfLtKPGs
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
The method used in Section 3 seems to suggest a database to store all the torrent data is required. While performance is better, it may not be a true DHT search engine.
Section 8, while less efficient, seems to be a DHT search engine as long as the keywords are the store values.
From Section 3, Bootstrapping Bittorent Search:
"The system handles user queries by treating the
concatenation of each torrent's filenames and description as a
document in the typical information retrieval model and using an
inverted index to match keywords to torrents. This has the advantage
of being well supported by popular open-source relational DBMSs. We
rank the search results according to the popularity of the torrent,
which we can infer from the number of peers listed in the DHT"
From Section 8, Related Work:
the usual approach to distributing search using a DHT is
with an inverted index, by storing each (keyword, list of matching
documents) pair as a key-value pair in the DHT. Joung et al. [17]
describe this approach and point out its performance problems: the
Zipf distribution of keywords among files results in very skewed load
balance, document information is replicated once for each keyword in
the document, and it is difficult to rank documents in a distributed
environment
It is divided into two steps.
To achieve bep_0005 protocol got infohash, you do not need to implement all protocol requires only now find_node (request), get_peers (response), announce_peer (response). Here's one of my open source dhtspider.
To achieve bep_0009 protocol got metainfo index it, here are my own a bittorrent search engine, every day can get unique infohash 300w +, effective metainfo 50w +.

Resources