libgit2sharp Diff of Tree with a not saved file - libgit2sharp

I would like to extend GitDiffMargin addin so that when a user is modifying a file in Visual Studio she can see the update diff in the margin even without saving the file.
Is it possible with libgit2sharp to do such a diff from a Tree and another Tree which I would have to build myself?

As far as I understand it, this question can be splitted into 3 sub questions:
How to diff two Trees
How to build a new Tree by modifying an existing one file (Blob) from it
How to create a Blob from the content of a file that hasn't been previously saved to disk.
How to diff two Trees:
API: repo.Diff.Compare<T>(Tree, Tree)
Tests: DiffTreeToTreeFixture.cs
How to build a new Tree by modifying an existing one file (Blob) from it:
API: TreeDefinition.From(Tree), TreeDefinition.Add(string, Blob, Mode) and repo.ObjectDatabase.CreateTree(TreeDefinition)
Tests: TreeDefinitionFixture.cs and ObjectDatabaseFixture.cs
How to create a Blob from the content of a file that hasn't been previously saved to disk:
API: CreateBlob(Stream, string)
Tests: ObjectDatabaseFixture.cs

Related

Copy folders into Azure data storage (azure data factory)

I am trying to copy folders with their files from ftp into an azure data storage, by looping through the folders and for each folder copy the content into a container that has the folder's name. for this, I used a metadata ,for each and copy data component. For now I am able to copy all the folders into the same container , but what I want is to have multiple containers named after the the folders in the output, each one containing files from the ftp.
ps : I am still new to azure data factory
Any advise or help is very welcome :)
You need to add a Get Metadata activity before the for-each. The Get Metadata activity will get the files in the current directory and pass them to the For-Each. You connect it to your Blob storage folder.
try something like this
Setup a JSON source:
Create a pipeline, use GetMetadata activity to list all the folders in the container/storage. Select fields as childItems
Feed the Metadata output (list of container contents) into filter activity and filter only folders.
Input the list of folders to a ForEach activity
Inside ForEach, set the current item() to a variable, and use it as a parameter for a parameterized source dataset which is a clone of original source !
This would result in listing the files from each folder in your container.
Feed this to another filter and this time filter on files. Use #equals(item().type,'File')
Now create another pipeline where we will have our copy activity running for each file found to be having same name as that of its parent folder.
Create parameters in the new child pipeline to receive the current Folder and File name in the iteration from Parent Pipeline to evaluate for copy.
Inside child pipeline, start with foreach whose input will be the list of filenames inside the folder received into parameter: #pipeline().parameters.filesnamesreceived
Use variable to hold the current item and use IfCondition to check if filename and folder names match.
Note: Try to evaluate dropping the file extension as per your requirement as metadata would hold the complete file name along with
its extension.
If True - > the names match, copy from source to sink.
Here the hierarchy is preserved and you can also use "Prefix" to mention the file path as it copies with preserving hierarchy. It utilizes the service-side filter for Blob storage, which provides better performance than a wildcard filter.
The sub-path after the last "/" in prefix will be preserved. For example, you have source container/folder/subfolder/file.txt, and configure prefix as folder/sub, then the preserved file path is subfolder/file.txt. Which fits your scenario.
This copies files like /source/source/source.json to /sink/source/source.json
AzCopy is simplest solution for this than Data factory, dry run can be used to check which files/folders will be copied
az storage blob copy start \
--destination-container destContainer \
--destination-blob myBlob \
--source-account-name mySourceAccount \
--source-account-key mySourceAccountKey \
--source-container myContainer \
--source-blob myBlob

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

How to predownload a transformers model

I want to perform a text generation task in a flask app and host it on a web server however when downloading the GPT models the elastic beanstalk managed EC2 instance crashes because the download takes too much time and memory
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
These are the lines in question causing the issue. GPT is approx 445 MB. I am using the transformers library. Instead of downloading the model at this line I was wondering if I could pickle the model and then bundle it as part of the repository. Is that possible with this library? Otherwise how can I preload this model to avoid the issues I am having?
Approach 1:
Search for the model here: https://huggingface.co/models
Download the model from this link:
pytorch-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin
tensorflow-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5
The config file: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json
Source: https://huggingface.co/transformers/_modules/transformers/configuration_openai.html#OpenAIGPTConfig
You can manually download the model (in your case TensorFlow model .h5 and the config.json file), put it in a folder (let's say model) in the repository. (you can try compressing the model, and then decompressing once it's in the ec2 instance if needed)
Then, you can directly load the model in your web server from the path instead of downloading (model folder which contains the .h5 and config.json):
model = TFOpenAIGPTLMHeadModel.from_pretrained("model")
# model folder contains .h5 and config.json
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
# this is a light download
Approach 2:
Instead of using links to download, you can download the model in your local machine using the conventional method.
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
This downloads the model. Now you can save the weights in a folder using save_pretrained function.
model.save_pretrained('/content/') # saving inside content folder
Now, the content folder should contain a .h5 file and a config.json.
Just upload them to the repository and load from that.
Open https://huggingface.co/models and search the model you want. Click on the model name and finnaly click on "List all files in model". You will get a list of the files you can download.

Determine if files are part of any package

Given I have a list of files, e.g foo/src/main.cpp, foo/src/bar.cpp, foo/README.md is it possible to determine which of those files are part of a bazel package?
In my example, the output would e.g. be foo/src/main.cpp, foo/src/bar.cpp since the README.md would not be part of the build.
One way to do this would be to call bazel query on each file and see if it results in an output, but that is quite inefficient and so I was wondering if there is an easier way.
Background: I am trying to determine if a changes in a set of files have an impact on a target, and I want to use bazel query somepath(//some/target, set($FILES)) for that, but this will fail if any of the files in $FILES is not part of a BUILD file.
How about flipping it around and querying for all the source files of the target with:
bazel query 'kind("source file", deps(//some:target))'
and then checking if the result has any of the files in the set

Git commit issues with multiple files having same name

I am working on a project that contains multiples files with same file name. I am using git to maintain local versions of my changes. After staging the modified files, I notice that files with same name are appearing with status "R" implying replacing one file with another of same name but in different directory tree. How to make sure both are committed without being replaced by one another. I could not find relevant help material regarding this in any of git documentation.
Since this is a proprietary code, I am pasting only sample directory structure:
M <Proj_Root_Folder>/<dirA>/<dirAA>/file1.h
M <Proj_Root_Folder>/<dirA>/<dirAA>/file2.h
M <Proj_Root_Folder>/<dirA>/<dirAA>/file3.h
R <Proj_Root_Folder>/<dirB>/<dirBA>/file4.h -> <Proj_Root_Folder>/<dirA>/<dirAA>/file4.h
In git "R" means "rename". Git thinks that ///file4.h is a file that has been moved from where it was originally. Most likely because the file looks simmer.

Resources