Parsing PDF by line - parsing

I've been able to parse a PDF by page multiple ways, the latest being this (not my code):
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "oldy.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default, [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
I found a post here that suggested using LocationTextExtractionStrategy instead and splitting each line out by '\n'
However, I will admit that the .NET code here is confusing me and i'm not sure how to modify it to parse by string.
Can anyone help?
thanks.

Only a first experiment, but it works as expected:
# Download http://sourceforge.net/projects/itextsharp/
Add-Type -Path itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList MyFile.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
# extract a page and split it into lines
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)
Write-Host "Page $($page) contains $($text.Length) lines. This is line 5:"
Write-Host $text[4]
#foreach ($line in $text)
#{
# any tasks
#}
}
$reader.Close()

Related

I need to upload a file and take the file name from the address

The code below will open an initial window where you need to input the number of files you are uploading, after that in the next window you select the files to be uploaded. Here you will see the full path of the file you have uploaded but I need only the file name that is selected, NOT the full path that is showing .e.g. System.Windows.Forms.TextBox, Text: C:\Users\Gourav Bid\Desktop\pwd.txt. I only need to take out the pwd.txt.
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms")
###### Initial Form Design ######
$objForm = New-Object System.Windows.Forms.Form
$objForm.Text = "File Upload"
$objForm.Size = New-Object System.Drawing.Size(350,200)
$objForm.StartPosition = "CenterScreen"
$objForm.KeyPreview = $True
$objForm.Add_KeyDown({if ($_.KeyCode -eq "Escape")
{$objForm.Close()}})
###### Label & TextBox Design ######
$objLabel = New-Object System.Windows.Forms.Label
$objLabel.Location = New-Object System.Drawing.Point(40,20)
$objLabel.Size = New-Object System.Drawing.Size(280,20)
$objLabel.Text = "Enter The Number Of Artefacts File Required :"
$objForm.Controls.Add($objLabel)
$objTextBox = New-Object System.Windows.Forms.TextBox
$objTextBox.Location = New-Object System.Drawing.Point(40,40)
$objTextBox.Size = New-Object System.Drawing.Size(260,20)
$objForm.Controls.Add($objTextBox)
###### Operation Buttons & Cancel Button Design ######
$OKButton = New-Object System.Windows.Forms.Button
$OKButton.Location = New-Object System.Drawing.Point(40,80)
$OKButton.Size = New-Object System.Drawing.Size(70,20)
$OKButton.Text = "OK"
$OKButton.Add_Click{(Button), $objForm.Close()}
$objForm.Controls.Add($OKButton)
$CancelButton = New-Object System.Windows.Forms.Button
$CancelButton.Location = New-Object System.Drawing.Point(120,80)
$CancelButton.Size = New-Object System.Drawing.Size(70,20)
$CancelButton.Text = "Cancel"
$CancelButton.Add_Click({$objForm.Close()})
$objForm.Controls.Add($CancelButton)
function Button {
[int]$artefacts = $objTextBox.Text
$d = 100 + $artefacts*30
$OKLoc = $d-70
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms")
$Form1 = New-Object System.Windows.Forms.Form
$Form1.Size = New-Object System.Drawing.Size(400,$d)
$Form1.Text = "Select Artefacts Files"
$Form1.StartPosition = "CenterScreen"
$Form1.AutoScroll = $True
$Form1.KeyPreview = $True
############################################## Buttons ##############################################
$OKButton1 = New-Object System.Windows.Forms.Button
$OKButton1.Location = New-Object System.Drawing.Size(20,$OKLoc)
$OKButton1.Size = New-Object System.Drawing.Size(75,23)
$OKButton1.Text = "OK"
$OKButton1.Add_Click({$x=$objTextBox.Text;$Form1.Close();{split};Write-Host "$arte1"})
$Form1.Controls.Add($OKButton1)
$CancelButton1 = New-Object System.Windows.Forms.Button
$CancelButton1.Location = New-Object System.Drawing.Size(150,$OKLoc)
$CancelButton1.Size = New-Object System.Drawing.Size(75,23)
$CancelButton1.Text = "Cancel"
$CancelButton1.Add_Click({$Form1.Close()})
$Form1.Controls.Add($CancelButton1)
############################################## Buttons ##############################################
if ($artefacts -eq 1) {
$Label = New-Object System.Windows.Forms.Label
$Label.Location = New-Object System.Drawing.Size(15,20)
$Label.Size = New-Object System.Drawing.Size(50,20)
$Label.Text = "File 1"
$Form1.Controls.Add($Label)
$arte = New-Object System.Windows.Forms.TextBox
$arte.Location = New-Object System.Drawing.Size(80,15)
$arte.Size = New-Object System.Drawing.Size(280,20)
$Form1.Controls.Add($arte)
}
$Button = New-Object System.Windows.Forms.Button
$Button.Location = New-Object System.Drawing.Size(270,$OKLoc)
$Button.Size = New-Object System.Drawing.Size(75,23)
$Button.Text = "Browse"
$Button.Add_Click({Button})
#Write-Host $test
$Form1.Controls.Add($Button)
#####################################################Function Button#####################################################
function Read-OpenFileDialog([string]$WindowTitle, [string]$InitialDirectory, [string]$Filter = "All files (*.*)|*.*", [switch]$AllowMultiSelect) {
Add-Type -AssemblyName System.Windows.Forms
$openFileDialog = New-Object System.Windows.Forms.OpenFileDialog
$openFileDialog.Title = $WindowTitle
if (![string]::IsNullOrWhiteSpace($InitialDirectory)) {
$openFileDialog.InitialDirectory = $InitialDirectory
}
$openFileDialog.Filter = $Filter
if ($AllowMultiSelect) {
$openFileDialog.MultiSelect = $true
}
$openFileDialog.ShowHelp = $true
# Without this line the ShowDialog() function may hang depending on system configuration and running from console vs. ISE.
$openFileDialog.ShowDialog() > $null
if ($AllowMultiSelect) {
return $openFileDialog.Filenames
} else {
return $openFileDialog.Filename
}
}
function Button {
for ($a=0;$a -lt $artefacts;$a++) {
$b = $a+1
$arte[$a].Text = Read-OpenFileDialog -WindowTitle "Select File $b" -InitialDirectory "C:\" -Filter "ALL files (*.*)|*.*"
}
Write-Host "$arte"
$arte1=$arte
}
function Split {
$arte1.split('\')[1]#.split(',')[0]
Write-Host "$arte1"
}
#####################################################Function Button
if ($artefacts -gt 1) {
$b = 20
$c = 15
for ($a=0;$a -lt $artefacts;$a++) {
$d = $a + 1
$Label = New-Object System.Windows.Forms.Label
$Label.Location = New-Object System.Drawing.Size(15,$b)
$Label.Size = New-Object System.Drawing.Size(50,20)
$Label.Text = "File $d"
$Form1.Controls.Add($Label)
[Array]$arte += New-Object System.Windows.Forms.TextBox
$arte[$a].Location = New-Object System.Drawing.Size(80,$c)
$arte[$a].Size = New-Object System.Drawing.Size(280,20)
$Form1.Controls.Add($arte[$a])
$b = $b+30
$c = $c+30
}
}
$Form1.Add_Shown({$Form1.Activate()})
[void] $Form1.ShowDialog()
}
$objForm.Topmost = $false
$objForm.Add_Shown({$objForm.Activate()})
[void] $objForm.ShowDialog()
You can use Split-Path if you only need the file name:
Split-Path "C:\Users\Gourav Bid\Desktop\pwd.txt" -Leaf
outputs
pwd.txt
In your example, you can replace
Write-Host "$arte"
with
$arte | % { Write-Host (Split-Path $_.Text -Leaf) }
it will iterate on $arte, and extract the filename from the .Text property.
The output you get is the whole TextBox object, you need to specify which property you want, in this case .Text.

WebClient.Downloadfile API is not working in powershell

(new-object System.Net.WebClient).Downloadfile("https://www.dropbox.com/sh/tsyz48qg0rq3smz/QAstBLgPgN/version.txt", "C:\Users\Brangle\Desktop\version.txt") API download invalid data.
version.txt file need to download. But actually it is downloading some xml file contains in version.txt on destination location
Thanks in advance
You are trying to download the dropbox page which presents your file in a nice dropbox-themed html. You need to extract the real url and can do so using the following code:
$wc = New-Object system.net.webclient;
$s = $wc.downloadString("https://www.dropbox.com/sh/tsyz48qg0rq3smz/QAstBLgPgN/version.txt");
$r = [regex]::matches($s, "https://.*token_hash.*(?=`")");
$realURL = $r[$r.count-1].Value;
$wc.Downloadfile($realURL, "U:\version.txt");
The regex part looks for a url starting https://, has a string token_hash in the middle and ends one character before double quotes character ". The line in question is:
FilePreview.init_text("https://dl.dropboxusercontent.com/sh/tsyz48qg0rq3smz/QAstBLgPgN/version.txt?token_hash=AAEGxMpsE-T4xodBPd3A6uPTCr0uqh7h4B2YUSmTDJHmjg", 0, null, 0)
Hope this helps.
Here is the function:
function download-dropbox($Url, $FilePath) {
$wc = New-Object system.net.webclient
$req = [System.Net.HttpWebRequest]::Create($Url)
$req.CookieContainer = New-Object System.Net.CookieContainer
$res = $req.GetResponse()
$cookies = $res.Cookies | % { $_.ToString()}
$cookies = $cookies -join '; '
$wc.Headers.Add([System.Net.HttpRequestHeader]::Cookie, $cookies)
$newurl = $url + '?dl=1'
mkdir (Split-Path $FilePath) -force -ea 0 | out-null
$wc.downloadFile($newurl, $tempFile)
}
Re: LogMeIn - theyuse a cookie base authentication so you can't use the previous code. Try this, it gets a cookie from the first response and then uses that to download using webclient:
$url = "https://secure.logmein.com/fileshare.asp?ticket=01_L5vwmOrmsS3mnxPO01f5FRbWUwVKlfheJ5HsfpTV"
$wc = New-Object system.net.webclient
$req = [System.Net.HttpWebRequest]::Create($url)
$req.CookieContainer = New-Object System.Net.CookieContainer
$res = $req.GetResponse()
$cookie = $res.Cookies.Name + "=" + $res.Cookies.Value
$wc.Headers.Add([System.Net.HttpRequestHeader]::Cookie, $cookie)
$newurl = $url + "`&download=1"
$wc.downloadFile($newurl, "c:\temp\temp.zip")

Combining Email code with if statement

I'm new to Powershell and am having trouble joining together two scripts I have.
What I want to do is check the length of all the csv files within a particular folder and if any of them are 0 Kb, I want to send off an alert email. So far I have a script which sends an email successfully and I have a script which checks the size successfully, but I am having trouble joining the two together.
Ideally it would send the name of the files which are empty in the body of the email.
The code below checks the file size and if it is greater than 1Kb it returns true.
$file = 'FilePath\File1.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File1.csv Contains Data"} ELSE {"File1.csv is Empty!"}
$file = 'FilePath\File2.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File2.csv Contains Data"} ELSE {"File2.csv is Empty!"}
$file = 'FilePath\File3.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File3.csv Contains Data"} ELSE {"File3.csv is Empty!"}
$file = 'FilePath\File4.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File4.csv Contains Data"} ELSE {"File4.csv is Empty!"}
$file = 'FilePath\FileName5.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File5.csv Contains Data"} ELSE {"File5.csv is Empty!"}
$file = 'FilePath\FileName6.csv'
$Result = if (Test-Path $file) { (Get-Item $file).length -gt 1kb }
if ($Result -eq "True") {"File6.csv Contains Data"} ELSE {"File6.csv is Empty!"}
Below is the email portion
$subject = "Emailtest"
$body = "test"
$emailTo = "jbloggs#Madeup.com"
$emailFrom ="JohnSmith#123.com"
$smtpServer = “mail.madeup.com”
$smtp = new-object Net.Mail.SmtpClient($smtpServer)
$credentials=new-object system.net.networkcredential(”username”,”password”)
$smtp.credentials=$credentials.getcredential($smtpserver,"25","basic")
$smtp.Send($emailFrom, $emailTo, $subject, $body)
Thank you for any help.
That's an awful lot of (manual) work just to check for empty files. What happens when you add a seventh - do you have to edit the script?
$EmptyFiles = (Get-childItem -Path $FilePath -Filter *.csv | `
where-object {$_.length -eq 0}|select-object -expandproperty Name)
$MsgBody = "The following files are empty:";
$EmptyFiles | foreach{$MsgBody+="`n$_";};
$MsgBody; # Just to output to console
$secpasswd = ConvertTo-SecureString "password" -AsPlainText -Force
$credentials= New-Object System.Management.Automation.PSCredential ("username", $secpasswd)
$subject = "Emailtest"
$body = "test"
$emailTo = "jbloggs#Madeup.com"
$emailFrom ="JohnSmith#123.com"
$smtpServer = “mail.madeup.com”
send-mailmessage -smtpserver $smtpServer -subject $subject -to $emailto -Credential $credentials -body $MsgBody

PowerShell parsing a PDF and extracting multiple lines

I'm using iTextSharp to search a PDF for a keyword, and extract any line(s) that contain that keyword. What I'd like to do is not only extract the line(s) with the keyword but subsequent lines.
Line with keyword and the next line, Line with keyword and the next 2 lines, etc.
I've been hung up on this for awhile, trying arrays, hash tables, iterators...none of them are working right. Any help is appreciated. This is the basic design i've been working with:
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList anypdf.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match $searchstring) {
$line = $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
$line = $line -replace "\\([\S])", $matches[1]
Write-host $line
}
}
}
I can't take credit for the logic that strips out the unwanted characters from the PDF, and that may be why I haven't figured this out yet. The above code gets me any line that contains the keyword. The problem seems to be the PDF is split into pages and those pages are split into lines (which are each an array of characters). It would be nice and efficient if I could simply create a hash table of every line in the PDF from the start.
That's what Select-String was invented for.
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
[char[]]$reader.GetPageContent($page) -join "" -split "`n" `
| Select-String $searchstring -Context 0,2 `
| % {
$_ -replace "^\[\(|\)\]TJ$", "" `
-split "\)\-?\d+\.?\d*\(" -join "" `
-replace "\\([\S])", $_.Matches.Value
}
}
I don't quite understand all the splitting and joinging and replacing you're doing there, so you may need to adjust that.
Also, the above doesn't include the after context, since I wouldn't know where you want it to go. It can be accessed via $_.Context.PostContext.

How to parse PDF content to database with powershell

I have a pdf document that I would like to extract content out of. The issue I am having is this... I search for the IMEI keyword, and it finds it, but I need the actual IMEI value which is the next item in the loop.
In the PDF the value looks like this:
IMEI 90289393092
returning value via the below script:
-0.1 -8.8 9.8 -0.1 446.7 403.9 Tm (IMEI:) Tj
I only want to have the value:
90289393092
Script I am using:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\PDF\DOC001.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match "IMEI") {
$line = $line -replace "\\([\S])", $matches[1]
$line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
}
}
}
this is the way for using itextsharp.dll and read a pdf as plain text:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList c:\ps\a.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
$Reader.Close();
And this can be the regex you need but I haven't tested it
[regex]::matches( $text, '(?<=IMEI\s+)(\d+)(?=\s+)' ) | select -expa value

Resources