Incorrect characters in document properties from PDF files
The process of displaying or printing content using font on a PDF pages can use a variety of encodings. The process of extracting text from a PDF page requires a UNICODE mapping from an encoded page character to a Unicode code point.
To extract content from PDF to UNICODE, fonts include a table that maps each PDF character to its UNICODE equivalent. Some fonts include multiple look-alike characters. For example, a font might include a dash, a minus sign, and a hyphen. Though they seem very similar, each one is a different character and is drawn slightly differently. Each one also has a different UNICODE codepoint. The mapping table determines which UNICODE character is used.
When Ricoh ProcessDirector extracts document property values from a PDF file using a control file created in Ricoh ProcessDirector Plug-in for Adobe Acrobat, it reads the value in UNICODE. Then the value is recorded in the Document Properties File (DPF), which requires data to be encoded in UTF-8 format. UTF-8 format uses multi-byte character sequences to represent UNICODE codepoints outside of the ASCII encoding range. As a result, the value is converted to the UNICODE equivalent character when it is added to the DPF file.
Problems can occur when values from the DPF are written back into the PDF file. If the UNICODE characters do not have PDF equivalents in the font, incorrect characters are inserted. These problems occur most often with subsetted and Identity-H fonts. Additional problems can occur when you search for explicit characters, but the UNICODE codepoints in the DPF are not the expected characters.
The ideal solution is to update the input PDF file so that it includes complete fonts instead of subsets. Another option is to add a step to your workflow that corrects the DPF properties. The native2ascii utility can be used to normalize the DPF file to an ASCII character encoding. The UNICODE codepoints that were encoded as UTF-8, will be normalized to a form \u####. An editor or a filter script can be used to change the problem character from the UNICODE \u#### to the actual ASCII character required. Once the ASCII version of the DPF is updated, the native2ascii utility would be used to convert the DPF back to the required UTF-8 encoding.
The native2ascii utility converts text to Unicode Latin-1. It is shipped with Ricoh ProcessDirector.
- On AIX and Linux, the native2ascii utility is stored in:
- On Windows, the native2ascii.exe utility is stored in:
The utility is also provided with the Java Development Kit, which you can download from this site: http://www.oracle.com/technetwork/java/javase/downloads
Instructions for using the utility (for Java 6) are here: http://download.oracle.com/javase/6/docs/technotes/tools/#intl