Parsing PDF files using SAP Conversion Agent

Part 1: Development in PI Design (IR)
Part 2: Development in Conversion Agent Studio
Part 3: Deploying Conversion Agent Project
Part 4: Development in PI Configuration (ID)
Part 5: Test the Scenario

Development in Conversion Agent Studio

Open the CA Studio and create a new Project with type parser as shown

Select Project type as Parser

Enter name of the Project

Enter name of the Parser

Enter name of the Parser Script

Select the Schema

Select File option

Select the PDF file

Select PDF option

Select PDF to Unicode (UTF-8)

Select Custom Format

Say finish.

Open the Project node in Conversion Agent Explorer and double click on BillParser_Script.tgp file and start development as shown below:

Now go through the Method-1 and Method-2, and select either of the methods, but Method-2 is always suggestible.

Method-1: Select the text and Drag directly to element in Schema

Select the text from Source View and drag to element in Schema.

Here Mr. SHATAVAHANA CHANDA is a text selected from source pdf file and drag to element ‘Name’ in the Schema view.

Similarly simply drag the remaining Address content.

Select text “FUJITSU CONSULTING INDIA PVT LTD,” and drag to element ‘Company’ in schema

Select text “# 703-704, ADITYA TRADE CENTER,” and drag to element ‘Door’ in schema

Select text “AMEERPET.BEHIND INDU PUBLICE SCHOOL” and drag to element ‘Street’ in schema

Select text “HYDERABAD 500016” and drag to element ‘City’ in schema

As mentioned above at Mark1: If the parameter default_transformers = RemoveMarginSpace () then it removes the unwanted margin spaces. This option makes easy to parse the file.

At Mark2: Content is used to parse the content on file and having options like

A. (opening_marker,closing _marker) which accepts numeric values this is used to parser the content with in the limit, let’s say if (opening_marker=0,closing_marker=10)then it will parse the content on file only within the limit. We can see more about this in next scenario.

B. In the current scenario we have chosen value = LearnByExample(“ <Text selected from File>“)

C. data_holder is mandatory here, why because you should mention the schema path to identify which element the selected content has to pass.

Most preferable method is (Method-2(A, C)) than (Method-1(B, C)), because if you follow (Method-1(B, C)) it will be fine if customer name is Mr. SHATAVAHANA CHANDA but in real time cases Vodafone can have multiple customers and if they want to process all the bills then it varies and gives error saying can’t parse the file because customer name and his address will be different. So better to follow the Method-2 as shown below.

Method-2: Parsing using Insert Offset Content

So to avoid that simply follow (step2 and step3) as shown

After setting the default_transformers = RemoveMarginSpace() then go to Source view at right side and select the text and right click and say Insert Offset Content