A data extraction - Need Ideas - parsing

Consider there are n rows of text similar to the ones below:
"Sony KDL46NX720 BRAVIA 46" 3D LED Backlit HDTV - 1080p, 1920 x 1080, 16:9, 120Hz, HDMI, USB, WiFi Ready » for $1148.99 at Tiger Direct"
"Samsung NV40 10.5 MP Digital Camera - Silver - 3x Zoom Lens » for $64.99 at eBay"
"Gateway NV57H27u 15.6" Notebook, Intel Core i3-2310M (2.10GHz), 4GB DDR3 Memory, 500GB HDD, DVD Super Multi-Drive, Windows 7 Home Premium 64-Bit (Pink) - LX.WZF02.002 » for $399.99 at Buy.com"
I would like to parse these strings and classify each of them as "TV, camera, laptop" etc.
The text attributes may or may not be similar.
How can this be comprehensively done?
What code/tools should I use?
What language?
I do not want to do a keyword search.
Can this strings be classified using class/attribute logic?
Can I use Protege to build the class/sub-class hierarchy?
I am totally new to this field of data-mining. So excuse my ignorance!
Thanks in advance.

Regular expresions, even a javascript can do the work
EDIT:
var criteria = {
camera : {
identifier : /.*camera.*/ ,
resolution : /.*(\d+)\s*x\s*(\d*).*/ ,
value : /.*$(\d+).*/ ,
...
},
notebook : {
identifier : /.*notebook.*/ ,
ram : /.*(d+)GB\s*(DDR.).*/
...
}
...
}
Then write a simple engine that use this structure to analize each line
EDIT 2:
This is not easy at all because you heve to feed some sort of knowlege database, but is posible, you can feed this with pages like this.
http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation
but is work for more than one person or for more than one day depending on how much intelligence you want for your code.

Related

Can I get some 3D model file(s) from 3D ultrasound?

do you know anybody if it is possible to get some model file from doctor when he made 3d ultrasound of pregnant woman? I mean something like DICOM (.dcm) file or .stl file or something like that what I can then work with and finaly print with 3D printer.
Thanks a lot.
Quick search for "dicom 3d ultrasound sample" resulted in one that you might be able to use for internal testing. You can get the file from here
Bonjour,
The first problem you will face is the file format.
Because of the way the images are generated, 3D ultrasound data have voxels that are expressed in a spherical system. DICOM (as it stand now) only support voxels in a Cartesian system.
So the manufacturers have a few choices:
They can save the data in proprietary format (ex: Kretzfile for Ge, MVL for Samsung).
They can save the data in private tags inside a DICOM file (Ge, Hitachi, Philips)
They can re-format the voxels to be in Cartesian, but then the data has been transformed and nobody like that. And anyway, since they also need to save the original (untransformed) data, the companies that do offer Cartesian voxels, usually save them in the same way as the original, so they are not saved in normal DICOM tags, but in their proprietary version.
This is why most of the standard software that can do 3D from CT or MR will not be able to cope with the data files.
Then the second problem is the noise. Ultrasound datasets are inherently very noisy! Again standard 3D reconstruction software where designed for CT or MR and have problems with this.
I do have a product that will read most of the 3D ultrasound files and create an STL model directly from the datasets (spherical or Cartesian). It is called baby SliceO (http://www.tomovision.com/products/baby_sliceo.html)
Unfortunately, it is not free, but you can try it without any licenses. Give it a try and let me know if you like it...
Yves

Sobel edge detection filter not correct output: can it be because of some parameters

I am using http://shakithweblog.blogspot.kr/2012/12/getting-sobel-filter-application.html for zynq processor.
I am using his filter design in the PL part and running the hdmi test.
I am inputting this file
and my filtered output is coming like this:
I am trying to display 1920 * 1080 pixels.
Now lets assume its difficult for you see exactly my design/ or download the design and check it or even you are not familiar with the zynq board and all but is it possible to make some guess that why the filter output could be like this? and what can I try to make it correct. I need some suggestions.

RedPitaya case or housing options?

The RedPitaya is a great looking instrument, but I'm afraid that I'll kill my new (expensive) device by stray voltage or ESD off my bench, within a few days.
Is it planned to make an optional "professional" case or similar to protect it?
Has anyone already created a 3D model so a printable case or housing could be made?
As a quick fix, I would:
use a piece of plain printer paper on the bench, underneath the Red Pitaya (it's typically more conductive than a typical plastic coating on the bench, but still not so cunductive as to short anything on the board bottom), and
more importantly, each time when approaching the bench, first touch the outside of one of the golden SMA jacks.
Probably any quick google search would answer the question but for the sake of completeness, I will answer this quesiton with what I found in my quick search.
Purchase Options
Now a days, there are several cases available for the pitaya:
Available on RS-components, Reichelt, amoung others:
RS Code: RS819-4077
Manufacturer: Red Pitaya
Manufacturer Ref: 1600 0715 001
Approximate cost: 20 Euros + Shipping + taxes
What appears to be 3d printed providers:
Nylon Plastic closed top case : Approximate 30 Euros
Nylon Plastic open top case
Printing Options
If you happen to have your own 3D printer then you can print one of many available designs.
Closed Top case
Open Top case
A Shielded Case Github Prject, on Thingiverse, on Youmagine
Others can be found on http://www.yeggi.com/ , http://grabcad.com/ ,...

Tesseract on iOS - bad results

After spending over 10 hours to compile tesseract using libc++ so it works with OpenCV, I've got issue getting any meaningful results. I'm trying to use it for digit recognition, the image data I'm passing is a small square (50x50) image with either one or no digits in it.
I've tried using both eng and equ tessdata (from google code), the results are different but both get guess 0 digits. Using eng data I get '4\n\n' or '\n\n' as a result most of the time (even when there's no digit in the image), with confidence anywhere from 1 to 99.
Using equ data I get '\n\n' with confidence 0-4.
I also tried binarizing the image and the results are more or less the same, I don't think there's a need for it though since images are filtered pretty good.
I'm assuming that there's something wrong since the images are pretty easy to recognize compared to even simplest of the example images.
Here's the code:
Initialization:
_tess = new TessBaseAPI();
_tess->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], "eng");
_tess->SetVariable("tessedit_char_whitelist", "0123456789");
_tess->SetVariable("classify_bln_numeric_mode", "1");
Recognition:
char *text = _tess->TesseractRect(imageData, (int)bytes_per_pixel, (int)bytes_per_line, 0, 0, (int)imageSize.width, (int)imageSize.height);
I'm getting no errors. TESSDATA_PREFIX is set properly and I've tried different methods for recognition. imageData looks ok when inspected.
Here are some sample images:
http://imgur.com/a/Kg8ar
Should this work with the regular training data?
Any help is appreciated, my first time trying tessarect out and I could have missed something.
EDIT:
I've found this:
_tess->SetPageSegMode(PSM_SINGLE_CHAR);
I'm assuming it must be used in this situation, tried it but got the same results.
I think Tesseract is a bit overkill for this stuff. You would be better off with a simple neural network, trained explicitly for your images. At my company, recently we were trying to use Tesseract on iOS for an OCR task (scanning utility bills with the camera), but it was too slow and inaccurate for our purposes (scanning took more than 30 seconds on an iPhone 4 at a tremendously low FPS). At the end, I trained a neural-network specifically for our target font, and this solution not only beat Tesseract (it could scan stuff flawlessly even on an iPhone 3Gs), but also a commercial ABBYY OCR engine, which we were given a sample from the company.
This course's material would be a good start in machine learning.

What is the trailing x

I noticed that their are a lot of technologies that uses X in their names like Directx and PhysX and X server ... is there a something common? Or is there any reason to choose X?
According to Wikipedia, the X in DirectX 'stands in' for the various Direct APIs - Direct3D, DirectSound, DirectPlay etc. Seems like a reasonable explanation.
PhysX probably plays on the whole DirectX 'thing' - but I expect it's named as such 'cause it sounds a bit like physics.
X Server serves X. :p
The meaning of the X varies by usage; in PhysX it seems to be the kewl[sic] way to spell Physics; whereas in X Server (part of the X Window System) takes it's name from being the natural evolution of a system named W (probably short for Window, or just the letter after V; the name of the system on which it ran).
DirectX has already been explained in another answer; so there's that.
But the main reason, most of the time; is that Poor Literacy Is Kewl[sic].

Resources