PowerShell parsing a PDF and extracting multiple lines - parsing

I'm using iTextSharp to search a PDF for a keyword, and extract any line(s) that contain that keyword. What I'd like to do is not only extract the line(s) with the keyword but subsequent lines.
Line with keyword and the next line, Line with keyword and the next 2 lines, etc.
I've been hung up on this for awhile, trying arrays, hash tables, iterators...none of them are working right. Any help is appreciated. This is the basic design i've been working with:
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList anypdf.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match $searchstring) {
$line = $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
$line = $line -replace "\\([\S])", $matches[1]
Write-host $line
}
}
}
I can't take credit for the logic that strips out the unwanted characters from the PDF, and that may be why I haven't figured this out yet. The above code gets me any line that contains the keyword. The problem seems to be the PDF is split into pages and those pages are split into lines (which are each an array of characters). It would be nice and efficient if I could simply create a hash table of every line in the PDF from the start.

That's what Select-String was invented for.
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
[char[]]$reader.GetPageContent($page) -join "" -split "`n" `
| Select-String $searchstring -Context 0,2 `
| % {
$_ -replace "^\[\(|\)\]TJ$", "" `
-split "\)\-?\d+\.?\d*\(" -join "" `
-replace "\\([\S])", $_.Matches.Value
}
}
I don't quite understand all the splitting and joinging and replacing you're doing there, so you may need to adjust that.
Also, the above doesn't include the after context, since I wouldn't know where you want it to go. It can be accessed via $_.Context.PostContext.

Related

Parse and change the output of a system through Powershell

initially I have to state, that I have little to no experience with powershell so far. A previous system generates the wrong output for me. So I want to use PowerShell to change this. From the System I get an output looking like this:
TEST1^|^9999^|^Y^|^NOT IN^|^('1','2','3')^|^N^|^LIKE^|^('4','5','6','7')^|^...^|^Y^|^NOT IN^|^('8','9','10','11','12')
TEST2^|^9998^|^Y^|^NOT IN^|^('4','5','6')^|^N^|^LIKE^|^('6','7','8','9')^|^...^|^Y^|^NOT IN^|^('1','2','15','16','17')^|^Y^|^NOT IN^|^('18','19','20','21','22')
When you look at it, there is a starting part for each line (TEST1^|^9999^|^) followed by a1 to a-n tuples (example: Y^|^NOT IN^|^('1','2','3')^|^).
The way I want this to look like is here:
TEST1^|^9999^|^Y^|^NOT IN^|^('1','2','3')
TEST1^|^9999^|^N^|^LIKE^|^('4','5','6','7')
TEST1^|^9999^|^Y^|^NOT IN^|^('8','9','10','11','12')
TEST2^|^9998^|^Y^|^NOT IN^|^('4','5','6')
TEST2^|^9998^|^N^|^LIKE^|^('6','7','8','9')
TEST2^|^9998^|^Y^|^NOT IN^|^('1','2','15','16','17')
TEST2^|^9998^|^Y^|^NOT IN^|^('18','19','20','21','22')
So the tuples shall be printed out per line, with the starting part attached in front.
My solution approach is the AWK equivalent in Powershell, but to date I lack the understanding of how to tackle the issue of how to deal with an indetermined number of tuples and to repeat the starting block.
I thank you so much in advance for your help!
I'd split the lines at ^|^ and recombine the fields of the resulting array in a loop. Something like this:
$sp = '^|^'
Get-Content 'C:\path\to\input.txt' | % {
$a = $_ -split [regex]::Escape($sp)
for ($i=2; $i -lt $a.length; $i+=3) {
"{0}$sp{1}$sp{2}$sp{3}$sp{4}" -f $a[0,1,$i,($i+1),($i+2)]
}
} | Set-Content 'C:\path\to\output.txt'
The data looks quite regular so you could loop over it using | as the delimiter and counting the following cells in 3s:
$data = #"
TEST1^|^9999^|^Y^|^NOT IN^|^('1','2','3')^|^N^|^LIKE^|^('4','5','6','7')^|^Y^|^NOT IN^|^('8','9','10','11','12')
TEST2^|^9998^|^Y^|^NOT IN^|^('4','5','6')^|^N^|^LIKE^|^('6','7','8','9')^|^Y^|^NOT IN^|^('1','2','15','16','17')^|^Y^|^NOT IN^|^('18','19','20','21','22')
"#
$data.split("`n") | % {
$ds = $_.split("|")
$heading = "$($ds[0])|$($ds[1])"
$j = 0
for($i = 2; $i -lt $ds.length; $i += 1) {
$line += "|$($ds[$i])" -replace "\^(\((?:'\d+',?)+\))\^?",'$1'
$j += 1
if($j -eq 3) {
write-host $heading$line
$line = ""
$j = 0
}
}
}
Parsing an arbitary length string record to row records is quite error prone. A simple solution would be processing the data row-by-row and creating output.
Here is a simple illustration how to process a single row. Processing the whole input file and writing output is left as trivial an exercise to the reader.
$s = "TEST1^|^9999^|^Y^|^NOT IN^|^('1','2','3')^|^N^|^LIKE^|^('4','5','6','7')^|^Y^|^NOT IN^|^('8','9','10','11','12')"
$t = $s.split('\)', [StringSplitOptions]::RemoveEmptyEntries)
$testNum = ([regex]::match($t[0], "(?i)(test\d+\^\|\^\d+)")).value # Hunt for 1st colum values
$t[0] = $t[0] + ')' # Fix split char remove
for($i=1;$i -lt $t.Length; ++$i) { $t[$i] = $testNum + $t[$i] + ')' } # Add 1st colum and split char remove
$t
TEST1^|^9999^|^Y^|^NOT IN^|^('1','2','3')
TEST1^|^9999^|^N^|^LIKE^|^('4','5','6','7')
TEST1^|^9999^|^Y^|^NOT IN^|^('8','9','10','11','12')

Parsing a value in powershell

So I run this command
$driverRAIDv = $data|where-object{$_.Name -eq "$serverName" -and ($_.Description1 -match "hpsa")} | select -ExpandProperty version
And it returns this value:
HP HPSA Driver (v 5.0.0-28OEM)
I want to take this value/variable and parse it so that I only have 5.0.0-28OEM
Try this, matches anything between ( and ) brackets:
EDIT: ..and removes the v followed by a space:
$driverRAIDv = $data|where-object{$_.Name -eq "$serverName" -and ($_.Description1 -match "hpsa")} | select -ExpandProperty version
$regex = "(?<=\().*(?=\))"
[regex]::matches($driverRAIDv,$regex).Value -replace "v "
which returns:
5.0.0-28OEM
or you could use following regex which will match anything between (v and )
$regex="(?<=\(v\s).*(?=\))"
Give this a try. Basically, we're creating a "calculated property" that contains an expression to parse out the portion of the value that you want.
$driverRAIDv = $data|where-object{$_.Name -eq "$serverName" -and ($_.Description1 -match "hpsa")} | select -Property #{ Label = 'Version'; Expression = { [void]($_ -match ('v\s(.*?)\)')); $matches[1]; }; };
I tested this out using the following code:
#{ Version = 'HP HPSA Driver (v 5.0.0-28OEM)'} | select -Property #{ Label = 'Version'; Expression = { [void]($_ -match ('v\s(.*?)\)')); $matches[1]; }; };
If you check out the help for the Select-Object command, you will see that you can create calculated properties, which basically modify the value of a property, through a PowerShell expression. To do this, create a Hashtable that contains two items: Label and Expression. The Label can simply be set to the same property value that you're modifying. The Expression is a PowerShell ScriptBlock that performs some type of operation, and returns a result. In the example I gave above, we are running a regular expression against the Version property, and returning it as the result.
You have many ways to do it depending on your needs and variance of what you will get.
One way could be to use replace operator. Say, you got
$driverRAIDv="HP HPSA Driver (v 5.0.0-28OEM)"
$driverRAIDv -replace 'HP HPSA Driver \(v ','' -replace '\)',''
would get you 5.0.0-28OEM
In the same line:
$driverRAIDv = ($data|where-object{$_.Name -eq "$serverName" -and ($_.Description1 -match "hpsa")} | select -ExpandProperty version) -replace 'HP HPSA Driver \(v ','' -replace '\)',''
A non-regex option would be to split on (, ) and (space), and select the last non-empty string.
$driverRAIDv = $data|
where {$_.Name -eq $serverName -and $_.Description1 -match "hpsa"} |
foreach {$_.version.split('[() ]')| where {$_}| select -last 1}

Parsing PDF by line

I've been able to parse a PDF by page multiple ways, the latest being this (not my code):
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "oldy.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default, [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
I found a post here that suggested using LocationTextExtractionStrategy instead and splitting each line out by '\n'
However, I will admit that the .NET code here is confusing me and i'm not sure how to modify it to parse by string.
Can anyone help?
thanks.
Only a first experiment, but it works as expected:
# Download http://sourceforge.net/projects/itextsharp/
Add-Type -Path itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList MyFile.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
# extract a page and split it into lines
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)
Write-Host "Page $($page) contains $($text.Length) lines. This is line 5:"
Write-Host $text[4]
#foreach ($line in $text)
#{
# any tasks
#}
}
$reader.Close()

How to parse PDF content to database with powershell

I have a pdf document that I would like to extract content out of. The issue I am having is this... I search for the IMEI keyword, and it finds it, but I need the actual IMEI value which is the next item in the loop.
In the PDF the value looks like this:
IMEI 90289393092
returning value via the below script:
-0.1 -8.8 9.8 -0.1 446.7 403.9 Tm (IMEI:) Tj
I only want to have the value:
90289393092
Script I am using:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\PDF\DOC001.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match "IMEI") {
$line = $line -replace "\\([\S])", $matches[1]
$line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
}
}
}
this is the way for using itextsharp.dll and read a pdf as plain text:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList c:\ps\a.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
$Reader.Close();
And this can be the regex you need but I haven't tested it
[regex]::matches( $text, '(?<=IMEI\s+)(\d+)(?=\s+)' ) | select -expa value

Text parsing in Powershell: Identify a target line and parse the next X lines to create objects

I am parsing text output from a disk array that lists information about LUN snapshots in a predictable format. After trying every other way to get this data out of the array in a useable manner, the only thing I can do is generate this text file and parse it. The output looks like this:
SnapView logical unit name: deleted_for_security_reasons
SnapView logical unit ID: 60:06:01:60:52:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX
Target Logical Unit: 291
State: Inactive
This repeats all through the file with one line break between each group. I want to identify a group, parse each of the four lines, create a new PSObject, add the value for each line as a new NoteProperty, and then add the new object to a collection.
What I can figure out is, once I identify the first line in the block of four lines, how to then process the text from lines two, three, and four. I'm looping through each line, finding the start of a block, and then processing it. Here's what I have so far, with comments where the magic goes:
$snaps = get-content C:\powershell\snaplist.txt
$snapObjects = #()
foreach ($line in $snaps)
{
if ([regex]::ismatch($line,"SnapView logical unit name"))
{
$snapObject = new-object system.Management.Automation.PSObject
$snapObject | add-member -membertype noteproperty -name "SnapName" -value $line.replace("SnapView logical unit name: ","")
#Go to the next line and add the UID
#Go to the next line and add the TLU
#Go to the next line and add the State
$snapObjects += $snapObject
}
}
I have scoured the Google and StackOverflow attempting to figure out how I can reference the line number of the object I'm iterating through, and I can't figure it out. I may rely on foreach loops too much and so that's affecting my thinking, I don't know.
As you say, I think you're thinking too much foreach when you should be thinking for. The below modification should be more along the lines of what you're looking for:
$snaps = get-content C:\powershell\snaplist.txt
$snapObjects = #()
for ($i = 0; $i -lt $snaps.length; $i++)
{
if ([regex]::ismatch($snaps[$i],"SnapView logical unit name"))
{
$snapObject = new-object system.Management.Automation.PSObject
$snapObject | add-member -membertype noteproperty -name "SnapName" -value ($snaps[$i]).replace("SnapView logical unit name: ","")
# $snaps[$i+1] Go to the next line and add the UID
# $snaps[$i+2] Go to the next line and add the TLU
# $snaps[$i+3] Go to the next line and add the State
$snapObjects += $snapObject
}
}
A while loop may be even cleaner because then you can increment $i by 4 instead of 1 when you hit this case, but since the other 3 lines won't trigger the "if" statement... there's no danger, just a few wasted cycles.
Another possibility
function Get-Data {
$foreach.MoveNext() | Out-Null
$null, $returnValue = $foreach.Current.Split(":")
$returnValue
}
foreach($line in (Get-Content "C:\test.dat")) {
if($line -match "SnapView logical unit name") {
$null, $Name = $line.Split(":")
$ID = Get-Data
$Unit = Get-Data
$State = Get-Data
New-Object PSObject -Property #{
Name = $Name.Trim()
ID = ($ID -join ":").Trim()
Unit = $Unit.Trim()
State = $State.Trim()
}
}
}
Name ID Unit State
---- -- ---- -----
deleted_for_security_reasons 60:06:01:60:52:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX 291 Inactive
switch -regex -file C:\powershell\snaplist.txt {
'^.+me:\s+(\S*)' {$SnapName = $Matches[1]}
'^.+ID:\s+(\S*)' {$UID = $Matches[1]}
'^.+it:\s+(\S*)' {$TLU = $Matches[1]}
'^.+te:\s+(\S*)' {
New-Object PSObject -Property #{
SnapName = $SnapName
UID = $UID
TLU = $TLU
State = $Matches[1]
}
}
}
try this
Get-Content "c:\temp\test.txt" | ConvertFrom-String -Delimiter ": " -PropertyNames Intitule, Value
if you have multiple packet try this
$template=#"
{Data:SnapView logical unit name: {UnitName:reasons}
SnapView logical unit ID: {UnitId:12:3456:Zz}
Target Logical Unit: {Target:123456789}
State: {State:A State}}
"#
Get-Content "c:\temp\test.txt" | ConvertFrom-String -TemplateContent $template | % {
[pscustomobject]#{
UnitName=$_.Data.UnitName
UnitId=$_.Data.UnitId
Target=$_.Data.Target
State=$_.Data.State
}
}

Resources