i need to extract some data from a 1 GB XML file into tables using
ets and dets.
I search the whole web and also in here but i didn't found any simple example
on how to handle big XML file.
For the beginning i just want to understand how to read the file without uploading the whole of it into memory.
Thnx.
come on ! What you need is a SAX XML parser called Erlsom. For small files, its possible to load it all into memory and then parse it as in the answer i gave to this question. But, for your case, these big files need the SAX method. The Sax examples are here.SAX ensures that you do not load a file into memory to parse it. The tokens that the parser gets , is what it gives to you. You will need an advanced skill of tail recursion, pattern matching and stateful programming.EDIT Now, download erlsom, and extract it into your erlang lib , a location where all built-in applications are located. Rename its extraction folder like this: erlsom-1.0. Create a file called: Emakefile in the erlsom-1.0 folder. Put this inside that file and save.
{"src/*", [verbose,report,warn_obsolete_guard,{outdir, "ebin"}]}.
The erlsom-1.0 folder, should look like this:
erlsom-1.0 |-doc/ |-ebin/ |-examples/ |-include/ |-src/ |-Emakefile
The rest of the other files do not matter. Now, open an erlang shell, whose pwd() is looking into the erlsom-1.0 folder. Run the function: make:all(). like this
Eshell V5.9 (abort with ^G)
1> make:all().
Recompile: src/ucs
Recompile: src/erlsom_writeHrl
Recompile: src/erlsom_write
Recompile: src/erlsom_ucs
Recompile: src/erlsom_simple_form
Recompile: src/erlsom_sax_utf8
Recompile: src/erlsom_sax_utf16le
Recompile: src/erlsom_sax_utf16be
Recompile: src/erlsom_sax_list
Recompile: src/erlsom_sax_lib
Recompile: src/erlsom_sax_latin1
Recompile: src/erlsom_sax
Recompile: src/erlsom_pass2
Recompile: src/erlsom_parseXsd
Recompile: src/erlsom_parse
Recompile: src/erlsom_lib
Recompile: src/erlsom_compile
Recompile: src/erlsom_add
Recompile: src/erlsom
up_to_date
2>
So, its done. So if the folder erlsom-1.0 is in your erlang lib, then, you can call the erlsom methods from any erlang shell whichever pwd() it may have.
Have you checked the xmerl library?
For reading big files and without loading it entirely into memory, you could use file:open/2, doing something like this:
{ok, FileHandler} = file:open(File, [read, raw, read_ahead]),
{ok, Line} = file:read_line(FileHandler)
Also, for working with XML, in Erlang you have xmerl, which unfortunately, is pretty poor documented.
I don't know Erlang, but it seems that it is possible to integrate C libraries. If you are willing to go that path, I can recommend that you have a look at the expat library. It is the quickes, light-weight xml parser library i know. A simple callback mechanism calls your code of each xml-element and you can decide for yourself whether you want to keep it in memory or if you want to skip it.
I know, this is very low-level. But for very large data this is, sadly, often the only way to do it.
Here is something I found googling: http://dudefrommangalore.blogspot.de/2009/04/erlang-xml-parser-comparison.html
Related
I have a pretty simple doubt, but I can't seem to find a proper solution for it anywhere.
I have 2 erlang modules, module1.erl and module2.erl. As defined by the submission guidelines for my project, both modules belong to different parts and are hence in different folders part1 and part2 respectively under the same directory. This is how the structure looks:
src/
part1/
module1.erl
part2/
module2.erl
Now module2 is dependent on module1, and calls various methods of module1 as module1:method(). I'm able to achieve full functionality when both module1.erl and module2.erl are in the same folder, but now they're in different folders, and I try to run module2.erl from part2 folder, I can't figure out how to allow module2 to compile and call the methods of module1.
Since the emulator is being used, the path to module1 is not in the code path by default when we are launching the emulator in the path of module2 directly and with default options. This can be verified using ...
1> code:get_path().
code:get_path().
[".","/usr/local/lib/erlang/lib/kernel-8.2/ebin",
"/usr/local/lib/erlang/lib/stdlib-3.17/ebin",
"/usr/local/lib/erlang/lib/xmerl-1.3.28/ebin",
"/usr/local/lib/erlang/lib/wx-2.1.1/ebin",
"/usr/local/lib/erlang/lib/tools-3.5.2/ebin",
"/usr/local/lib/erlang/lib/tftp-1.0.3/ebin",
"/usr/local/lib/erlang/lib/syntax_tools-2.6/ebin",
"/usr/local/lib/erlang/lib/ssl-10.6/ebin",
"/usr/local/lib/erlang/lib/ssh-4.13/ebin",
"/usr/local/lib/erlang/lib/snmp-5.11/ebin",
"/usr/local/lib/erlang/lib/sasl-4.1.1/ebin",
"/usr/local/lib/erlang/lib/runtime_tools-1.17/ebin",
"/usr/local/lib/erlang/lib/reltool-0.9/ebin",
"/usr/local/lib/erlang/lib/public_key-1.11.3/ebin",
"/usr/local/lib/erlang/lib/parsetools-2.3.2/ebin",
"/usr/local/lib/erlang/lib/os_mon-2.7.1/ebin",
"/usr/local/lib/erlang/lib/odbc-2.13.5/ebin",
"/usr/local/lib/erlang/lib/observer-2.10.1/ebin",
"/usr/local/lib/erlang/lib/mnesia-4.20.1/ebin",
"/usr/local/lib/erlang/lib/megaco-4.2/ebin",
"/usr/local/lib/erlang/lib/inets-7.5/ebin",
"/usr/local/lib/erlang/lib/hipe-4.0.1/ebin",
"/usr/local/lib/erlang/lib/ftp-1.1/ebin",
"/usr/local/lib/erlang/lib/eunit-2.7/ebin",
"/usr/local/lib/erlang/lib/et-1.6.5/ebin",
"/usr/local/lib/erlang/lib/erts-12.2/ebin",
"/usr/local/lib/erlang/lib/erl_interface-5.1/ebin",
[...]|...]
This list has . in it, but not ../part1, so when when we are compiling, been in the part2 directory, it fails...
2> c(module1).
c(module1).
{error,non_existing}
There are several ways to work around this, few simple ones could be...
c("../part1/module1.erl").. As per the documentation of c
...Module can be either a module name or a source file path, with or without .erl extension...
And here, in the option above, we used the relative path to the source file of module1.
Invoke erl with an option -pa which adds the path of part1 in the code path for that session of the erlang emulator.
part2$ erl -pa "../part1"
Erlang/OTP 25 [DEVELOPMENT] [erts-12.2] [source-c1ab4b5424] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit:ns]
Eshell V12.2 (abort with ^G)
1> code:get_path().
["../part1",".","/usr/local/lib/erlang/lib/kernel-8.2/ebin",
"/usr/local/lib/erlang/lib/stdlib-3.17/ebin",
"/usr/local/lib/erlang/lib/xmerl-1.3.28/ebin",
"/usr/local/lib/erlang/lib/wx-2.1.1/ebin",
"/usr/local/lib/erlang/lib/tools-3.5.2/ebin",
"/usr/local/lib/erlang/lib/tftp-1.0.3/ebin",
"/usr/local/lib/erlang/lib/syntax_tools-2.6/ebin",
"/usr/local/lib/erlang/lib/ssl-10.6/ebin",
"/usr/local/lib/erlang/lib/ssh-4.13/ebin",
"/usr/local/lib/erlang/lib/snmp-5.11/ebin",
"/usr/local/lib/erlang/lib/sasl-4.1.1/ebin",
"/usr/local/lib/erlang/lib/runtime_tools-1.17/ebin",
"/usr/local/lib/erlang/lib/reltool-0.9/ebin",
"/usr/local/lib/erlang/lib/public_key-1.11.3/ebin",
"/usr/local/lib/erlang/lib/parsetools-2.3.2/ebin",
"/usr/local/lib/erlang/lib/os_mon-2.7.1/ebin",
"/usr/local/lib/erlang/lib/odbc-2.13.5/ebin",
"/usr/local/lib/erlang/lib/observer-2.10.1/ebin",
"/usr/local/lib/erlang/lib/mnesia-4.20.1/ebin",
"/usr/local/lib/erlang/lib/megaco-4.2/ebin",
"/usr/local/lib/erlang/lib/inets-7.5/ebin",
"/usr/local/lib/erlang/lib/hipe-4.0.1/ebin",
"/usr/local/lib/erlang/lib/ftp-1.1/ebin",
"/usr/local/lib/erlang/lib/eunit-2.7/ebin",
"/usr/local/lib/erlang/lib/et-1.6.5/ebin",
"/usr/local/lib/erlang/lib/erts-12.2/ebin",
[...]|...]
2> c(module2).
{ok,module2}
3> c(module1).
Recompiling /home/nalin/source/erlang/part2/../part1/module1.erl
{ok,module1}
4> module2:exec().
"Module2"
5> module1:exec().
"Module1"
6>
I hope this much should be sufficient to get you going. Also, you must take an opportunity to read through the Compilation and Code Loading to have some idea around what goes on.
WYSIWYG => WHAT YOU SHOW IS WHAT YOU GET
I am working on a tool that deals with BEAM files, and we want to be able to assume the code was compiled with -Werror, so we don't have to repeat validations that are already done by the erl_lint compiler pass.
Is there a way to figure out if the BEAM was built with -Werror?
I'd expect beam_lib:chunks/2 to help here, but unfortunately it doesn't seem to have what I'm looking for:
beam_lib:chunks("sample.beam", [debug_info, attributes, compile_info]).
% the stuff returned says nothing about -Werror, even if I compile with -Werror
It seems that this information would be always stripped
However, if you are in control of compilation process - you can put additional info into beam files, - which will be accessible through M:module_info(compile) and via beam chunks as well.
For example in rebar:
{erl_opts, [debug_info, {compile_info, [{my_key, my_value}]}]}.
And then:
1> my_module:module_info(compile).
[{version,"7.6.6"},
{options,[debug_info, ...
{my_key,my_value}]
The same is true for "discoverability" of this key directly from beam chunks:
2> beam_lib:chunks("my_beam.beam", [compile_info]).
{ok, ... {my_key,my_value}]}]}}
Meaning, that you can "stamp" your beam files with some meta-information easily. So, a workaround may be to stamp those beam files with this mark.
I two modules in same src folder. mod1 declares function I wish to use in module mod2:
-module(mod1).
-export([myfunc/1]).
myfunc(A) -> {ok}.
In other module I not import mod1:
-module(mod2).
If I do "mod1:" in mod2 it recognizes "myfunc", problem is at run-time when I call mod1:myfunc(A) I get "undefined function mod1:myfunc/1"
I not understand why I get error if intellisense detect my mod1 function in mod2?
From the shell, you could try doing mod1:module_info(exports) to see the list of all the exported functions, though if your module is written as it is above, it should be generating a syntax error.
If, however, I'm wrong, and you actually do have it written properly in your module, (ie, it's just a typo here), try doing the following at the erlang shell:
c(mod1).
c(mod2).
And see if that works for you. This will compile and load the modules for you. If you don't have the module compiled (ie, it's just a .erl file in the directory), that's insufficient.
EDIT
Also, make sure that the beam files are being loaded properly when erlang launches. This is typically done by launching erl with erl -pa /path/to/beams
I've got a newbie question.
I'm trying to parse a xml message with pattern matching in functions
A sample of a message is:
<msg> <action type="xxx"... /> </msg>
What I would like to able to do is ( sort of )
decode_msg_in( << $<,$m,$s,$g,$>, Message/binary, $<,$/,$m,$s,$g,$> >>, R ) ->
The decode does not work (obviously, it's only a indication on what I'd like to do )
Is this even possible?
Does anyone have an idea? Or do I need to "iterate" the whole message as a list, building new "words" ?
Regards
/P
i probably think you need to read about Bit syntax expressions, Binary Comprehensions and about this xml parser library called erlsom, download it here. You will be brought up to speed in what you want to do.EDIT The xml message may reach your server as a binary, or as a string: Which ever way it does, the xml parser provided can parse the xml data into Erlang terms. Using the erlsom library, here is an example for your xml structure. I have my erlsom library in code path.
C:\Windows\System32>erl
Eshell V5.9 (abort with ^G)
1> XML = "<msg><action type=\"xxx\"/>message</msg>".
"<msg><action type=\"xxx\"/>message</msg>"
2> erlsom:simple_form(XML).
{ok,{"msg",[],[{"action",[{"type","xxx"}],[]},"message"]},
[]}
3> {_,Parsed,_} = erlsom:simple_form(XML).
{ok,{"msg",[],[{"action",[{"type","xxx"}],[]},"message"]},
[]}
4> Parsed.
{"msg",[],[{"action",[{"type","xxx"}],[]},"message"]}
5> {_,_,[{_,[{_,ActionType}],_},Message]} = Parsed.
{"msg",[],[{"action",[{"type","xxx"}],[]},"message"]}
6> ActionType.
"xxx"
7> Message.
"message"
8>
You can see above that it comes down to easy pattern matching. The resulting structure will give you something clean as long as the senders send properly formatted xml data. If you suspect improper xml data to hit your server, then, you need to wrap the parser in try [CALL] of [GoodResult] -> [Action1] catch _Error:_Reason -> [Action2] end. Note that if the XML Body is very large, you need to use SAX method to parse the xml to avoid big memory foot prints. SAX examples are included in the library documentation.
In the recent Erlang R14, inets' file httpd.hrl has been moved from:
src/httpd.hrl
to:
src/http_server/httpd.hrl
The Erlang Web framework includes the file in several places using the following directive:
-include_lib("inets/src/httpd.hrl").
Since I'd love the Erlang Web to compile with both versions of Erlang (R13 and R14), what I'd need ideally is:
-ifdef(OLD_ERTS_VERSION).
-include_lib("inets/src/httpd.hrl").
-else.
-include_lib("inets/src/http_server/httpd.hrl").
-endif.
Even if it possible to retrieve the ERTS version via:
erlang:system_info(version).
That's indeed not possible at pre-processing time.
How to deal with these situations? Is the parse transform the only way to go? Are there better alternatives?
Not sure if you'll like this hackish trick, but you could use a parse transform.
Let's first define a basic parse transform module:
-module(erts_v).
-export([parse_transform/2]).
parse_transform(AST, _Opts) ->
io:format("~p~n", [AST]).
Compile it, then you can include both headers in the module you want this to work for. This should give the following:
-module(test).
-compile({parse_transform, erts_v}).
-include_lib("inets/src/httpd.hrl").
-include_lib("inets/src/http_server/httpd.hrl").
-export([fake_fun/1]).
fake_fun(A) -> A.
If you're on R14B and compile it, you should have the abstract format of the module looking like this:
[{attribute,1,file,{"./test.erl",1}},
{attribute,1,module,test},
{error,{3,epp,{include,lib,"inets/src/httpd.hrl"}}},
{attribute,1,file,
{"/usr/local/lib/erlang/lib/inets-5.5/src/http_server/httpd.hrl",1}},
{attribute,1,file,
{"/usr/local/lib/erlang/lib/kernel-2.14.1/include/file.hrl",1}},
{attribute,24,record,
{file_info,
[{record_field,25,{atom,25,size}},
{record_field,26,{atom,26,type}},
...
What this tells us is that we can use both headers, and the valid one will automatically be included while the other will error out. All we need to do is remove the {error,...} tuple and get a working compilation. To do this, fix the parse_transform module so it looks like this:
-module(erts_v).
-export([parse_transform/2]).
parse_transform(AST, _Opts) ->
walk_ast(AST).
walk_ast([{error,{_,epp,{include,lib,"inets/src/httpd.hrl"}}}|AST]) ->
AST;
walk_ast([{error,{_,epp,{include,lib,"inets/src/http_server/httpd.hrl"}}}|AST]) ->
AST;
walk_ast([H|T]) ->
[H|walk_ast(T)].
This will then remove the error include, only if it's on the precise module you wanted. Other messy includes should fail as usual.
I haven't tested this on all versions, so if the behaviour changed between them, this won't work. On the other hand, if it stayed the same, this parse_transform will be version independent, at the cost of needing to order the compiling order of your modules, which is simple enough with Emakefiles and rebar.
If you are using makefiles, you can do something like
ERTS_VER=$(shell erl +V 2>&1 | egrep -o '[0-9]+.[0-9]+.[0-9]+')
than match the string and define macro in erlc arguments or in Emakefile.
There is no other way, AFAIK.