Where exactly in IPFS.create() or IPFS.add() does my node propagate an updated distributed hash table upon adding a file? - bootstrapping

Original question: Does the IPFS.add() method automatically update my local DHT and propagate it to other peers?
In order to test whether the IPFS.add() method alone allows other peers to download content from a my pc, I ran this script on my windows pc:
import * as IPFS from "ipfs";
const node = await IPFS.create();
var file = await node.add("Shiiiiiiiiiiitttt");
and ran this code on my macbook to fetch the file:
import * as IPFS from "ipfs";
const node = await IPFS.create();
//fetching Shiiiiiiiiiiitttt
const stream = node.cat("QmetK5x9nLUG5jDwp7Un25n47exuNjDZ3cKvnKfC6Hebmi")
const decoder = new TextDecoder()
let data = ''
for await (const chunk of stream) {
// chunks of data are returned as a Uint8Array, convert it back to a string
data += decoder.decode(chunk, { stream: true })
console.log("decoding")
}
//At the end, as long as ipfs is running in owner node, there was no need for ipfs.dht.provide method call
console.log(data)
What I found out through this test was that as long as I keep running jsipfs daemon or the script itself on the windows pc that adds a file or text, I can retrieve it from other devices using IPFS.cat(). This confuses me deeply, since IPFS also has a separate method, IPFS.dht.provide(), and my current understanding of IPFS dictates that an updated dht propagation to other peers is necessary to enable them to fetch files. However, from my test, I can logically conclude that there has to be some method within IPFS.add() that propagates an updated distributed hash table or at least a similar alternative to other peers so that they know I have the file. I'm having a very difficult time finding these source methods for automatic dht propagation upon adding a file and would appreciate any help on finding the said methods or an under-the-hood explanation of what happens during IPFS.add() .

Check out the ipfs docks https://docs.ipfs.tech/concepts/ , far more important than you first think.

Related

Blocked when read from os.Stdin when PIPE the output in docker

I'm trying to pipe the output(logs) of a program to a Go program which aggregates/compress the output and uploads to S3. The command to run the program is "/program1 | /logShipper". The logShipper is written in Go and it's simply read from os.Stdin and write to a local file. The local file will be processed by another goroutine and upload to S3 periodically. There are some existing docker log drivers but we are running the container on a fully managed provider and the log processing charge is pretty expensive, so we want to bypass the existing solution and just upload to S3.
The main logic of the logShipper is simply read from the os.Stdin and write to some file. It's work correctly when running on the local machine but when running in docker the goroutine blocked at reader.ReadString('\n') and never return.
go func() {
reader := bufio.NewReader(os.Stdin)
mu.Lock()
output = openOrCreateOutputFile(&uploadQueue, workPath)
mu.Unlock()
for {
text, _ := reader.ReadString('\n')
now := time.Now().Format("2006-01-02T15:04:05.000000000Z")
mu.Lock()
output.file.Write([]byte(fmt.Sprintf("%s %s", now, text)))
mu.Unlock()
}
}()
I did some research online but not find why it's not working. One possibility I'm thinking is might docker redirect the stdout to somewhere so the PIPE not working the same way as it's running on a Linux box? (As looks like it can't read anything from program1) Any help or suggestion why it not working is welcome. Thanks.
Edit:
After doing more research I realized it's a bad practice to handle the logs in this way. I should more rely on the docker's log driver to handle the log aggregate and shipping. However, I'm still interested to find out why it's not read anything from the PIPE source program.
I'm not sure about the way the Docker handles output, but I suggest that you extract the file descriptor with os.Stdin.Fd() and then resort to using golang.org/x/sys/unix package as follows:
// Long way, for short one jump
// down straight to it.
//
// retrieve the file descriptor
// cast it to int, because Fd method
// returns uintptr
fd := int(os.Stdin.Fd())
// extract file descriptor flags
// it's safe to drop the error, since if it's there
// and it's not nil, you won't be able to read from
// Stdin anyway, unless it's a notice
// to try again, which mostly should not be
// the case
flags, _ := unix.FcntlInt(fd, unix.F_GETFL, 0)
// check if the nonblocking reading in enabled
nb := flags & unix.O_NONBLOCK != 0
// if this is the case, just enable it with
// unix.SetNonblock which is also a
// -- SHORT WAY HERE --
err = unix.SetNonblock(fd, true)
The difference between the long and a short way is that the long way will definitely tell you, if the problem is in the nonblocking state absence or not.
If this is not the case. Then I have no other ideas personally.

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

Can Dataflow sideInput be updated per window by reading a gcs bucket?

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));
You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

Using Neo4j with React JS

Can we use graph database neo4j with react js? If not so is there any alternate option for including graph database in react JS?
Easily, all you need is neo4j-driver: https://www.npmjs.com/package/neo4j-driver
Here is the most simplistic usage:
neo4j.js
//import { v1 as neo4j } from 'neo4j-driver'
const neo4j = require('neo4j-driver').v1
const driver = neo4j.driver('bolt://localhost', neo4j.auth.basic('username', 'password'))
const session = driver.session()
session
.run(`
MATCH (n:Node)
RETURN n AS someName
`)
.then((results) => {
results.records.forEach((record) => console.log(record.get('someName')))
session.close()
driver.close()
})
It is best practice to close the session always after you get the data. It is inexpensive and lightweight.
It is best practice to only close the driver session once your program is done (like Mongo DB). You will see extreme errors if you close the driver at a bad time, which is incredibly important to note if you are beginner. You will see errors like 'connection to server closed', etc. In async code, for example, if you run a query and close the driver before the results are parsed, you will have a bad time.
You can see in my example that I close the driver after, but only to illustrate proper cleanup. If you run this code in a standalone JS file to test, you will see node.js hangs after the query and you need to press CTRL + C to exit. Adding driver.close() fixes that. Normally, the driver is not closed until the program exits/crashes, which is never in a Backend API, and not until the user logs out in the Frontend.
Knowing this now, you are off to a great start.
Remember, session.close() immediately every time, and be careful with the driver.close().
You could put this code in a React component or action creator easily and render the data.
You will find it no different than hooking up and working with Axios.
You can run statements in a transaction also, which is beneficial for writelocking affected nodes. You should research that thoroughly first, but transaction flow is like this:
const session = driver.session()
const tx = session.beginTransaction()
tx
.run(query)
.then(// same as normal)
.catch(// errors)
// the difference is you can chain multiple transactions:
const tx1 = await tx.run().then()
// use results
const tx2 = await tx.run().then()
// then, once you are ready to commit the changes:
if (results.good !== true) {
tx.rollback()
session.close()
throw error
}
await tx.commit()
session.close()
const finalResults = { tx1, tx2 }
return finalResults
// in my experience, you have to await tx.commit
// in async/await syntax conditions, otherwise it may not commit properly
// that operation is not instant
tl;dr;
Yes, you can!
You are mixing two different technologies together. Neo4j is graph database and React.js is framework for front-end.
You can connect to Neo4j from JavaScript - http://neo4j.com/developer/javascript/
Interesting topic. I am using the driver in a React App and recently experienced some issues. I am closing the session every time a lifecycle hook completes like in your example. When there where more intensive queries I would see a timeout error. Going back to my setup decided to experiment by closing the driver in some more expensive queries and it looks like (still need more testing) the crashes are gone.
If you are deploying a real-world application I would urge you to think about Authentication and Authorization when using a DB-React setup only as you would have to store username/password of the neo4j server in the client. I am looking into options of having the Neo4J server issuing a token and receiving it for Authorization but the best practice is for sure to have a Node.js server in the middle with something like Passport to handle Authentication.
So, all in all, maybe the best scenario is to only use the driver in Node and have the browser always communicating with the Node server using axios...

Wireshark Dissector in Lua error: "Tree item ProtoField/Protocol handle is invalid"

I'm new to Lua altogether, and this is my first attempt at writing a wireshark dissector.
I want to analyze SSH without cipher by Lua script. I write a script to detect the packet length and padding length for the first step.
Here is my script:
do
local p_test = Proto("test","Test.");
local f_packet_length = ProtoField.uint32("packet_length")
local f_padding_length = ProtoField.uint8("padding_length")
p_test.fields = {
f_packet_length,
f_padding_length
}
function p_test.dissector(buf,pkt,root)
local offset = 0
local buf_len = buf:len()
local t = root:add(p_test, buf:range(offset))
t:add(f_packet_length,buf:range(offset,4))
offset = offset+4
t:add(f_padding_length,buf:range(offset,1))
offset = offset+1
end
local tcp_table = DissectorTable.get("tcp.port")
tcp_table:add(22,p_test)
end
After I run the code through Evalutate Lua and applied the test filter, I find that there is an error in Packet Details:
Lua Error: [string "do..."]:19: Tree item ProtoField/Protocol handle is invalid (ProtoField/Proto not registered?)
Line 19 corresponds the t:add(f_packet_length... line.
Could anyone help to explain this error?
Thanks in advance.
Your code above will work fine if it's in a real Lua script for Wireshark... either by being in a .lua file in the personal plugins directory, or by being loaded with the "-X lua_script:<filename>" command line switch.
But you can't register a new protocol in the tools->evaluate window, because it's too late by then to register a new protocol (or new fields). Unfortunately the error Wireshark reports isn't clear about that, because it sort of half works, but really it's not working and cannot work.
The problem is new protocol registration happens in two phases internally: the first phase is when the Lua scripts are loaded and executed, which adds the protocol and fields to an internal temporary table, and then after all the lua scripts load then the second phase has new protocols and fields moved from the temporary table into their final run-time tables and registered, and then wireshark finishes loading and you see the GUI. That second phase happens once and only once, when Wireshark first starts up. But running the tools->evaluate window happens after all that, so it's too late.

Resources