tesseract 4.1.1
Loading...
Searching...
No Matches
pdfrenderer.cpp
Go to the documentation of this file.
1
2// File: pdfrenderer.cpp
3// Description: PDF rendering interface to inject into TessBaseAPI
4//
5// (C) Copyright 2011, Google Inc.
6// Licensed under the Apache License, Version 2.0 (the "License");
7// you may not use this file except in compliance with the License.
8// You may obtain a copy of the License at
9// http://www.apache.org/licenses/LICENSE-2.0
10// Unless required by applicable law or agreed to in writing, software
11// distributed under the License is distributed on an "AS IS" BASIS,
12// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13// See the License for the specific language governing permissions and
14// limitations under the License.
15//
17
18// Include automatically generated configuration file if running autoconf.
19#ifdef HAVE_CONFIG_H
20#include "config_auto.h"
21#endif
22
23#include <locale> // for std::locale::classic
24#include <memory> // std::unique_ptr
25#include <sstream> // for std::stringstream
26#include "allheaders.h"
27#include "baseapi.h"
28#include <cmath>
29#include "renderer.h"
30#include <cstring>
31#include "tprintf.h"
32
33/*
34
35Design notes from Ken Sharp, with light editing.
36
37We think one solution is a font with a single glyph (.notdef) and a
38CIDToGIDMap which maps all the CIDs to 0. That map would then be
39stored as a stream in the PDF file, and when flat compressed should
40be pretty small. The font, of course, will be approximately the same
41size as the one you currently use.
42
43I'm working on such a font now, the CIDToGIDMap is trivial, you just
44create a stream object which contains 128k bytes (2 bytes per possible
45CID and your CIDs range from 0 to 65535) and where you currently have
46"/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
47
48Note that if, in future, you were to use a different (ie not 2 byte)
49CMap for character codes you could trivially extend the CIDToGIDMap.
50
51The following is an explanation of how some of the font stuff works,
52this may be too simple for you in which case please accept my
53apologies, its hard to know how much knowledge someone has. You can
54skip all this anyway, its just for information.
55
56The font embedded in a PDF file is usually intended just to be
57rendered, but extensions allow for at least some ability to locate (or
58copy) text from a document. This isn't something which was an original
59goal of the PDF format, but its been retro-fitted, presumably due to
60popular demand.
61
62To do this reliably the PDF file must contain a ToUnicode CMap, a
63device for mapping character codes to Unicode code points. If one of
64these is present, then this will be used to convert the character
65codes into Unicode values. If its not present then the reader will
66fall back through a series of heuristics to try and guess the
67result. This is, as you would expect, prone to failure.
68
69This doesn't concern you of course, since you always write a ToUnicode
70CMap, so because you are writing the text in text rendering mode 3 it
71would seem that you don't really need to worry about this, but in the
72PDF spec you cannot have an isolated ToUnicode CMap, it has to be
73attached to a font, so in order to get even copy/paste to work you
74need to define a font.
75
76This is what leads to problems, tools like pdfwrite assume that they
77are going to be able to (or even have to) modify the font entries, so
78they require that the font being embedded be valid, and to be honest
79the font Tesseract embeds isn't valid (for this purpose).
80
81
82To see why lets look at how text is specified in a PDF file:
83
84(Test) Tj
85
86Now that looks like text but actually it isn't. Each of those bytes is
87a 'character code'. When it comes to rendering the text a complex
88sequence of events takes place, which converts the character code into
89'something' which the font understands. Its entirely possible via
90character mappings to have that text render as 'Sftu'
91
92For simple fonts (PostScript type 1), we use the character code as the
93index into an Encoding array (256 elements), each element of which is
94a glyph name, so this gives us a glyph name. We then consult the
95CharStrings dictionary in the font, that's a complex object which
96contains pairs of keys and values, you can use the key to retrieve a
97given value. So we have a glyph name, we then use that as the key to
98the dictionary and retrieve the associated value. For a type 1 font,
99the value is a glyph program that describes how to draw the glyph.
100
101For CIDFonts, its a little more complicated. Because CIDFonts can be
102large, using a glyph name as the key is unreasonable (it would also
103lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
104as the key. CIDs are just numbers.
105
106But.... We don't use the character code as the CID. What we do is use
107a CMap to convert the character code into a CID. We then use the CID
108to key the CharStrings dictionary and proceed as before. So the 'CMap'
109is the equivalent of the Encoding array, but its a more compact and
110flexible representation.
111
112Note that you have to use the CMap just to find out how many bytes
113constitute a character code, and it can be variable. For example you
114can say if the first byte is 0x00->0x7f then its just one byte, if its
1150x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
116have seen CMaps defining character codes up to 5 bytes wide.
117
118Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
119TrueType CIDFonts. The thing is that TrueType fonts are accessed using
120a Glyph ID (GID) (and the LOCA table) which may well not be anything
121like the CID. So for this case PDF includes a CIDToGIDMap. That maps
122the CIDs to GIDs, and we can then use the GID to get the glyph
123description from the GLYF table of the font.
124
125So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
126
127Looking at the PDF file I was supplied with we see that it contains
128text like :
129
130<0x0075> Tj
131
132So we start by taking the character code (117) and look it up in the
133CMap. Well you don't supply a CMap, you just use the Identity-H one
134which is predefined. So character code 117 maps to CID 117. Then we
135use the CIDToGIDMap, again you don't supply one, you just use the
136predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
137were supplied with only contains 116 glyphs.
138
139Now for Latin that's not a huge problem, you can just supply a bigger
140font. But for more complex languages that *is* going to be more of a
141problem. Either you need to supply a font which contains glyphs for
142all the possible CID->GID mappings, or we need to think laterally.
143
144Our solution using a TrueType CIDFont is to intervene at the
145CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
146font with just one glyph, the .notdef glyph at GID 0. This is what I'm
147looking into now.
148
149It would also be possible to have a 'PostScript' (ie type 1 outlines)
150CIDFont which contained 1 glyph, and a CMap which mapped all character
151codes to CID 0. The effect would be the same.
152
153Its possible (I haven't checked) that the PostScript CIDFont and
154associated CMap would be smaller than the TrueType font and associated
155CIDToGIDMap.
156
157--- in a followup ---
158
159OK there is a small problem there, if I use GID 0 then Acrobat gets
160upset about it and complains it cannot extract the font. If I set the
161CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
162mad......
163
164*/
165
166namespace tesseract {
167
168// If the font is 10 pts, nominal character width is 5 pts
169static const int kCharWidth = 2;
170
171// Used for memory allocation. A codepoint must take no more than this
172// many bytes, when written in the PDF way. e.g. "<0063>" for the
173// letter 'c'
174static const int kMaxBytesPerCodepoint = 20;
175
176/**********************************************************************
177 * PDF Renderer interface implementation
178 **********************************************************************/
179TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
180 bool textonly)
181 : TessResultRenderer(outputbase, "pdf"),
182 datadir_(datadir) {
183 obj_ = 0;
184 textonly_ = textonly;
185 offsets_.push_back(0);
186}
187
188void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
189 offsets_.push_back(objectsize + offsets_.back());
190 obj_++;
191}
192
193void TessPDFRenderer::AppendPDFObject(const char *data) {
194 AppendPDFObjectDIY(strlen(data));
195 AppendString(data);
196}
197
198// Helper function to prevent us from accidentally writing
199// scientific notation to an HOCR or PDF file. Besides, three
200// decimal points are all you really need.
201static double prec(double x) {
202 double kPrecision = 1000.0;
203 double a = round(x * kPrecision) / kPrecision;
204 if (a == -0)
205 return 0;
206 return a;
207}
208
209static long dist2(int x1, int y1, int x2, int y2) {
210 return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
211}
212
213// Viewers like evince can get really confused during copy-paste when
214// the baseline wanders around. So I've decided to project every word
215// onto the (straight) line baseline. All numbers are in the native
216// PDF coordinate system, which has the origin in the bottom left and
217// the unit is points, which is 1/72 inch. Tesseract reports baselines
218// left-to-right no matter what the reading order is. We need the
219// word baseline in reading order, so we do that conversion here. Returns
220// the word's baseline origin and length.
221static void GetWordBaseline(int writing_direction, int ppi, int height,
222 int word_x1, int word_y1, int word_x2, int word_y2,
223 int line_x1, int line_y1, int line_x2, int line_y2,
224 double *x0, double *y0, double *length) {
225 if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
226 Swap(&word_x1, &word_x2);
227 Swap(&word_y1, &word_y2);
228 }
229 double word_length;
230 double x, y;
231 {
232 int px = word_x1;
233 int py = word_y1;
234 double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
235 if (l2 == 0) {
236 x = line_x1;
237 y = line_y1;
238 } else {
239 double t = ((px - line_x2) * (line_x2 - line_x1) +
240 (py - line_y2) * (line_y2 - line_y1)) / l2;
241 x = line_x2 + t * (line_x2 - line_x1);
242 y = line_y2 + t * (line_y2 - line_y1);
243 }
244 word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
245 word_x2, word_y2)));
246 word_length = word_length * 72.0 / ppi;
247 x = x * 72 / ppi;
248 y = height - (y * 72.0 / ppi);
249 }
250 *x0 = x;
251 *y0 = y;
252 *length = word_length;
253}
254
255// Compute coefficients for an affine matrix describing the rotation
256// of the text. If the text is right-to-left such as Arabic or Hebrew,
257// we reflect over the Y-axis. This matrix will set the coordinate
258// system for placing text in the PDF file.
259//
260// RTL
261// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
262// [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
263static void AffineMatrix(int writing_direction,
264 int line_x1, int line_y1, int line_x2, int line_y2,
265 double *a, double *b, double *c, double *d) {
266 double theta = atan2(static_cast<double>(line_y1 - line_y2),
267 static_cast<double>(line_x2 - line_x1));
268 *a = cos(theta);
269 *b = sin(theta);
270 *c = -sin(theta);
271 *d = cos(theta);
272 switch(writing_direction) {
274 *a = -*a;
275 *b = -*b;
276 break;
278 // TODO(jbreiden) Consider using the vertical PDF writing mode.
279 break;
280 default:
281 break;
282 }
283}
284
285// There are some really awkward PDF viewers in the wild, such as
286// 'Preview' which ships with the Mac. They do a better job with text
287// selection and highlighting when given perfectly flat baseline
288// instead of very slightly tilted. We clip small tilts to appease
289// these viewers. I chose this threshold large enough to absorb noise,
290// but small enough that lines probably won't cross each other if the
291// whole page is tilted at almost exactly the clipping threshold.
292static void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
293 int *line_x1, int *line_y1,
294 int *line_x2, int *line_y2) {
295 *line_x1 = x1;
296 *line_y1 = y1;
297 *line_x2 = x2;
298 *line_y2 = y2;
299 int rise = abs(y2 - y1) * 72;
300 int run = abs(x2 - x1) * 72;
301 if (rise < 2 * ppi && 2 * ppi < run)
302 *line_y1 = *line_y2 = (y1 + y2) / 2;
303}
304
305static bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
306 if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
307 tprintf("Dropping invalid codepoint %d\n", code);
308 return false;
309 }
310 if (code < 0x10000) {
311 snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
312 } else {
313 int a = code - 0x010000;
314 int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
315 int low_surrogate = (0x03FF & a) + 0xDC00;
316 snprintf(utf16, kMaxBytesPerCodepoint,
317 "%04X%04X", high_surrogate, low_surrogate);
318 }
319 return true;
320}
321
322char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
323 double width, double height) {
324 double ppi = api->GetSourceYResolution();
325
326 // These initial conditions are all arbitrary and will be overwritten
327 double old_x = 0.0, old_y = 0.0;
328 int old_fontsize = 0;
329 tesseract::WritingDirection old_writing_direction =
331 bool new_block = true;
332 int fontsize = 0;
333 double a = 1;
334 double b = 0;
335 double c = 0;
336 double d = 1;
337
338 std::stringstream pdf_str;
339 // Use "C" locale (needed for double values prec()).
340 pdf_str.imbue(std::locale::classic());
341 // Use 8 digits for double values.
342 pdf_str.precision(8);
343
344 // TODO(jbreiden) This marries the text and image together.
345 // Slightly cleaner from an abstraction standpoint if this were to
346 // live inside a separate text object.
347 pdf_str << "q " << prec(width) << " 0 0 " << prec(height) << " 0 0 cm";
348 if (!textonly_) {
349 pdf_str << " /Im1 Do";
350 }
351 pdf_str << " Q\n";
352
353 int line_x1 = 0;
354 int line_y1 = 0;
355 int line_x2 = 0;
356 int line_y2 = 0;
357
358 ResultIterator *res_it = api->GetIterator();
359 while (!res_it->Empty(RIL_BLOCK)) {
360 if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
361 pdf_str << "BT\n3 Tr"; // Begin text object, use invisible ink
362 old_fontsize = 0; // Every block will declare its fontsize
363 new_block = true; // Every block will declare its affine matrix
364 }
365
366 if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
367 int x1, y1, x2, y2;
368 res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
369 ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
370 }
371
372 if (res_it->Empty(RIL_WORD)) {
373 res_it->Next(RIL_WORD);
374 continue;
375 }
376
377 // Writing direction changes at a per-word granularity
378 tesseract::WritingDirection writing_direction;
379 {
380 tesseract::Orientation orientation;
381 tesseract::TextlineOrder textline_order;
382 float deskew_angle;
383 res_it->Orientation(&orientation, &writing_direction,
384 &textline_order, &deskew_angle);
385 if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
386 switch (res_it->WordDirection()) {
388 writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
389 break;
391 writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
392 break;
393 default:
394 writing_direction = old_writing_direction;
395 }
396 }
397 }
398
399 // Where is word origin and how long is it?
400 double x, y, word_length;
401 {
402 int word_x1, word_y1, word_x2, word_y2;
403 res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
404 GetWordBaseline(writing_direction, ppi, height,
405 word_x1, word_y1, word_x2, word_y2,
406 line_x1, line_y1, line_x2, line_y2,
407 &x, &y, &word_length);
408 }
409
410 if (writing_direction != old_writing_direction || new_block) {
411 AffineMatrix(writing_direction,
412 line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
413 pdf_str << " " << prec(a) // . This affine matrix
414 << " " << prec(b) // . sets the coordinate
415 << " " << prec(c) // . system for all
416 << " " << prec(d) // . text that follows.
417 << " " << prec(x) // .
418 << " " << prec(y) // .
419 << (" Tm "); // Place cursor absolutely
420 new_block = false;
421 } else {
422 double dx = x - old_x;
423 double dy = y - old_y;
424 pdf_str << " " << prec(dx * a + dy * b)
425 << " " << prec(dx * c + dy * d)
426 << (" Td "); // Relative moveto
427 }
428 old_x = x;
429 old_y = y;
430 old_writing_direction = writing_direction;
431
432 // Adjust font size on a per word granularity. Pay attention to
433 // fontsize, old_fontsize, and pdf_str. We've found that for
434 // in Arabic, Tesseract will happily return a fontsize of zero,
435 // so we make up a default number to protect ourselves.
436 {
437 bool bold, italic, underlined, monospace, serif, smallcaps;
438 int font_id;
439 res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
440 &serif, &smallcaps, &fontsize, &font_id);
441 const int kDefaultFontsize = 8;
442 if (fontsize <= 0)
443 fontsize = kDefaultFontsize;
444 if (fontsize != old_fontsize) {
445 pdf_str << "/f-0-0 " << fontsize << " Tf ";
446 old_fontsize = fontsize;
447 }
448 }
449
450 bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
451 bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
452 std::string pdf_word;
453 int pdf_word_len = 0;
454 do {
455 const std::unique_ptr<const char[]> grapheme(
456 res_it->GetUTF8Text(RIL_SYMBOL));
457 if (grapheme && grapheme[0] != '\0') {
458 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
459 char utf16[kMaxBytesPerCodepoint];
460 for (char32 code : unicodes) {
461 if (CodepointToUtf16be(code, utf16)) {
462 pdf_word += utf16;
463 pdf_word_len++;
464 }
465 }
466 }
467 res_it->Next(RIL_SYMBOL);
468 } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
469 if (res_it->IsAtBeginningOf(RIL_WORD)) {
470 pdf_word += "0020";
471 pdf_word_len++;
472 }
473 if (word_length > 0 && pdf_word_len > 0) {
474 double h_stretch =
475 kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
476 pdf_str << h_stretch << " Tz" // horizontal stretch
477 << " [ <" << pdf_word // UTF-16BE representation
478 << "> ] TJ"; // show the text
479 }
480 if (last_word_in_line) {
481 pdf_str << " \n";
482 }
483 if (last_word_in_block) {
484 pdf_str << "ET\n"; // end the text object
485 }
486 }
487 const std::string& text = pdf_str.str();
488 char* result = new char[text.length() + 1];
489 strcpy(result, text.c_str());
490 delete res_it;
491 return result;
492}
493
494bool TessPDFRenderer::BeginDocumentHandler() {
495 AppendPDFObject("%PDF-1.5\n%\xDE\xAD\xBE\xEB\n");
496
497 // CATALOG
498 AppendPDFObject("1 0 obj\n"
499 "<<\n"
500 " /Type /Catalog\n"
501 " /Pages 2 0 R\n"
502 ">>\nendobj\n");
503
504 // We are reserving object #2 for the /Pages
505 // object, which I am going to create and write
506 // at the end of the PDF file.
507 AppendPDFObject("");
508
509 // TYPE0 FONT
510 AppendPDFObject("3 0 obj\n"
511 "<<\n"
512 " /BaseFont /GlyphLessFont\n"
513 " /DescendantFonts [ 4 0 R ]\n" // CIDFontType2 font
514 " /Encoding /Identity-H\n"
515 " /Subtype /Type0\n"
516 " /ToUnicode 6 0 R\n" // ToUnicode
517 " /Type /Font\n"
518 ">>\n"
519 "endobj\n");
520
521 // CIDFONTTYPE2
522 std::stringstream stream;
523 // Use "C" locale (needed for int values larger than 999).
524 stream.imbue(std::locale::classic());
525 stream <<
526 "4 0 obj\n"
527 "<<\n"
528 " /BaseFont /GlyphLessFont\n"
529 " /CIDToGIDMap 5 0 R\n" // CIDToGIDMap
530 " /CIDSystemInfo\n"
531 " <<\n"
532 " /Ordering (Identity)\n"
533 " /Registry (Adobe)\n"
534 " /Supplement 0\n"
535 " >>\n"
536 " /FontDescriptor 7 0 R\n" // Font descriptor
537 " /Subtype /CIDFontType2\n"
538 " /Type /Font\n"
539 " /DW " << (1000 / kCharWidth) << "\n"
540 ">>\n"
541 "endobj\n";
542 AppendPDFObject(stream.str().c_str());
543
544 // CIDTOGIDMAP
545 const int kCIDToGIDMapSize = 2 * (1 << 16);
546 const std::unique_ptr<unsigned char[]> cidtogidmap(
547 new unsigned char[kCIDToGIDMapSize]);
548 for (int i = 0; i < kCIDToGIDMapSize; i++) {
549 cidtogidmap[i] = (i % 2) ? 1 : 0;
550 }
551 size_t len;
552 unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
553 stream.str("");
554 stream <<
555 "5 0 obj\n"
556 "<<\n"
557 " /Length " << len << " /Filter /FlateDecode\n"
558 ">>\n"
559 "stream\n";
560 AppendString(stream.str().c_str());
561 long objsize = stream.str().size();
562 AppendData(reinterpret_cast<char *>(comp), len);
563 objsize += len;
564 lept_free(comp);
565 const char *endstream_endobj =
566 "endstream\n"
567 "endobj\n";
568 AppendString(endstream_endobj);
569 objsize += strlen(endstream_endobj);
570 AppendPDFObjectDIY(objsize);
571
572 const char stream2[] =
573 "/CIDInit /ProcSet findresource begin\n"
574 "12 dict begin\n"
575 "begincmap\n"
576 "/CIDSystemInfo\n"
577 "<<\n"
578 " /Registry (Adobe)\n"
579 " /Ordering (UCS)\n"
580 " /Supplement 0\n"
581 ">> def\n"
582 "/CMapName /Adobe-Identify-UCS def\n"
583 "/CMapType 2 def\n"
584 "1 begincodespacerange\n"
585 "<0000> <FFFF>\n"
586 "endcodespacerange\n"
587 "1 beginbfrange\n"
588 "<0000> <FFFF> <0000>\n"
589 "endbfrange\n"
590 "endcmap\n"
591 "CMapName currentdict /CMap defineresource pop\n"
592 "end\n"
593 "end\n";
594
595 // TOUNICODE
596 stream.str("");
597 stream <<
598 "6 0 obj\n"
599 "<< /Length " << (sizeof(stream2) - 1) << " >>\n"
600 "stream\n" << stream2 <<
601 "endstream\n"
602 "endobj\n";
603 AppendPDFObject(stream.str().c_str());
604
605 // FONT DESCRIPTOR
606 stream.str("");
607 stream <<
608 "7 0 obj\n"
609 "<<\n"
610 " /Ascent 1000\n"
611 " /CapHeight 1000\n"
612 " /Descent -1\n" // Spec says must be negative
613 " /Flags 5\n" // FixedPitch + Symbolic
614 " /FontBBox [ 0 0 " << (1000 / kCharWidth) << " 1000 ]\n"
615 " /FontFile2 8 0 R\n"
616 " /FontName /GlyphLessFont\n"
617 " /ItalicAngle 0\n"
618 " /StemV 80\n"
619 " /Type /FontDescriptor\n"
620 ">>\n"
621 "endobj\n";
622 AppendPDFObject(stream.str().c_str());
623
624 stream.str("");
625 stream << datadir_.c_str() << "/pdf.ttf";
626 FILE *fp = fopen(stream.str().c_str(), "rb");
627 if (!fp) {
628 tprintf("Cannot open file \"%s\"!\n", stream.str().c_str());
629 return false;
630 }
631 fseek(fp, 0, SEEK_END);
632 auto size = std::ftell(fp);
633 if (size < 0) {
634 fclose(fp);
635 return false;
636 }
637 fseek(fp, 0, SEEK_SET);
638 const std::unique_ptr<char[]> buffer(new char[size]);
639 if (!tesseract::DeSerialize(fp, buffer.get(), size)) {
640 fclose(fp);
641 return false;
642 }
643 fclose(fp);
644 // FONTFILE2
645 stream.str("");
646 stream <<
647 "8 0 obj\n"
648 "<<\n"
649 " /Length " << size << "\n"
650 " /Length1 " << size << "\n"
651 ">>\n"
652 "stream\n";
653 AppendString(stream.str().c_str());
654 objsize = stream.str().size();
655 AppendData(buffer.get(), size);
656 objsize += size;
657 AppendString(endstream_endobj);
658 objsize += strlen(endstream_endobj);
659 AppendPDFObjectDIY(objsize);
660 return true;
661}
662
663bool TessPDFRenderer::imageToPDFObj(Pix *pix,
664 const char* filename,
665 long int objnum,
666 char **pdf_object,
667 long int* pdf_object_size,
668 const int jpg_quality) {
669 if (!pdf_object_size || !pdf_object)
670 return false;
671 *pdf_object = nullptr;
672 *pdf_object_size = 0;
673 if (!filename && !pix)
674 return false;
675
676 L_Compressed_Data *cid = nullptr;
677
678 int sad = 0;
679 if (pixGetInputFormat(pix) == IFF_PNG)
680 sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
681 if (!cid) {
682 sad = l_generateCIDataForPdf(filename, pix, jpg_quality, &cid);
683 }
684
685 if (sad || !cid) {
686 l_CIDataDestroy(&cid);
687 return false;
688 }
689
690 const char *group4 = "";
691 const char *filter;
692 switch(cid->type) {
693 case L_FLATE_ENCODE:
694 filter = "/FlateDecode";
695 break;
696 case L_JPEG_ENCODE:
697 filter = "/DCTDecode";
698 break;
699 case L_G4_ENCODE:
700 filter = "/CCITTFaxDecode";
701 group4 = " /K -1\n";
702 break;
703 case L_JP2K_ENCODE:
704 filter = "/JPXDecode";
705 break;
706 default:
707 l_CIDataDestroy(&cid);
708 return false;
709 }
710
711 // Maybe someday we will accept RGBA but today is not that day.
712 // It requires creating an /SMask for the alpha channel.
713 // http://stackoverflow.com/questions/14220221
714 std::stringstream colorspace;
715 // Use "C" locale (needed for int values larger than 999).
716 colorspace.imbue(std::locale::classic());
717 if (cid->ncolors > 0) {
718 colorspace
719 << " /ColorSpace [ /Indexed /DeviceRGB " << (cid->ncolors - 1)
720 << " " << cid->cmapdatahex << " ]\n";
721 } else {
722 switch (cid->spp) {
723 case 1:
724 if (cid->bps == 1 && pixGetInputFormat(pix) == IFF_PNG) {
725 colorspace.str(" /ColorSpace /DeviceGray\n"
726 " /Decode [1 0]\n");
727 } else {
728 colorspace.str(" /ColorSpace /DeviceGray\n");
729 }
730 break;
731 case 3:
732 colorspace.str(" /ColorSpace /DeviceRGB\n");
733 break;
734 default:
735 l_CIDataDestroy(&cid);
736 return false;
737 }
738 }
739
740 int predictor = (cid->predictor) ? 14 : 1;
741
742 // IMAGE
743 std::stringstream b1;
744 // Use "C" locale (needed for int values larger than 999).
745 b1.imbue(std::locale::classic());
746 b1 <<
747 objnum << " 0 obj\n"
748 "<<\n"
749 " /Length " << cid->nbytescomp << "\n"
750 " /Subtype /Image\n";
751
752 std::stringstream b2;
753 // Use "C" locale (needed for int values larger than 999).
754 b2.imbue(std::locale::classic());
755 b2 <<
756 " /Width " << cid->w << "\n"
757 " /Height " << cid->h << "\n"
758 " /BitsPerComponent " << cid->bps << "\n"
759 " /Filter " << filter << "\n"
760 " /DecodeParms\n"
761 " <<\n"
762 " /Predictor " << predictor << "\n"
763 " /Colors " << cid->spp << "\n" << group4 <<
764 " /Columns " << cid->w << "\n"
765 " /BitsPerComponent " << cid->bps << "\n"
766 " >>\n"
767 ">>\n"
768 "stream\n";
769
770 const char *b3 =
771 "endstream\n"
772 "endobj\n";
773
774 size_t b1_len = b1.str().size();
775 size_t b2_len = b2.str().size();
776 size_t b3_len = strlen(b3);
777 size_t colorspace_len = colorspace.str().size();
778
779 *pdf_object_size =
780 b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
781 *pdf_object = new char[*pdf_object_size];
782
783 char *p = *pdf_object;
784 memcpy(p, b1.str().c_str(), b1_len);
785 p += b1_len;
786 memcpy(p, colorspace.str().c_str(), colorspace_len);
787 p += colorspace_len;
788 memcpy(p, b2.str().c_str(), b2_len);
789 p += b2_len;
790 memcpy(p, cid->datacomp, cid->nbytescomp);
791 p += cid->nbytescomp;
792 memcpy(p, b3, b3_len);
793 l_CIDataDestroy(&cid);
794 return true;
795}
796
797bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
798 Pix *pix = api->GetInputImage();
799 const char* filename = api->GetInputName();
800 int ppi = api->GetSourceYResolution();
801 if (!pix || ppi <= 0)
802 return false;
803 double width = pixGetWidth(pix) * 72.0 / ppi;
804 double height = pixGetHeight(pix) * 72.0 / ppi;
805
806 std::stringstream xobject;
807 // Use "C" locale (needed for int values larger than 999).
808 xobject.imbue(std::locale::classic());
809 if (!textonly_) {
810 xobject << "/XObject << /Im1 " << (obj_ + 2) << " 0 R >>\n";
811 }
812
813 // PAGE
814 std::stringstream stream;
815 // Use "C" locale (needed for double values width and height).
816 stream.imbue(std::locale::classic());
817 stream.precision(2);
818 stream << std::fixed <<
819 obj_ << " 0 obj\n"
820 "<<\n"
821 " /Type /Page\n"
822 " /Parent 2 0 R\n" // Pages object
823 " /MediaBox [0 0 " << width << " " << height << "]\n"
824 " /Contents " << (obj_ + 1) << " 0 R\n" // Contents object
825 " /Resources\n"
826 " <<\n"
827 " " << xobject.str() << // Image object
828 " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
829 " /Font << /f-0-0 3 0 R >>\n" // Type0 Font
830 " >>\n"
831 ">>\n"
832 "endobj\n";
833 pages_.push_back(obj_);
834 AppendPDFObject(stream.str().c_str());
835
836 // CONTENTS
837 const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
838 const size_t pdftext_len = strlen(pdftext.get());
839 size_t len;
840 unsigned char *comp_pdftext = zlibCompress(
841 reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
842 long comp_pdftext_len = len;
843 stream.str("");
844 stream <<
845 obj_ << " 0 obj\n"
846 "<<\n"
847 " /Length " << comp_pdftext_len << " /Filter /FlateDecode\n"
848 ">>\n"
849 "stream\n";
850 AppendString(stream.str().c_str());
851 long objsize = stream.str().size();
852 AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
853 objsize += comp_pdftext_len;
854 lept_free(comp_pdftext);
855 const char *b2 =
856 "endstream\n"
857 "endobj\n";
858 AppendString(b2);
859 objsize += strlen(b2);
860 AppendPDFObjectDIY(objsize);
861
862 if (!textonly_) {
863 char *pdf_object = nullptr;
864 int jpg_quality;
865 api->GetIntVariable("jpg_quality", &jpg_quality);
866 if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize,
867 jpg_quality)) {
868 return false;
869 }
870 AppendData(pdf_object, objsize);
871 AppendPDFObjectDIY(objsize);
872 delete[] pdf_object;
873 }
874 return true;
875}
876
877
878bool TessPDFRenderer::EndDocumentHandler() {
879 // We reserved the /Pages object number early, so that the /Page
880 // objects could refer to their parent. We finally have enough
881 // information to go fill it in. Using lower level calls to manipulate
882 // the offset record in two spots, because we are placing objects
883 // out of order in the file.
884
885 // PAGES
886 const long int kPagesObjectNumber = 2;
887 offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
888 std::stringstream stream;
889 // Use "C" locale (needed for int values larger than 999).
890 stream.imbue(std::locale::classic());
891 stream << kPagesObjectNumber << " 0 obj\n<<\n /Type /Pages\n /Kids [ ";
892 AppendString(stream.str().c_str());
893 size_t pages_objsize = stream.str().size();
894 for (size_t i = 0; i < pages_.unsigned_size(); i++) {
895 stream.str("");
896 stream << pages_[i] << " 0 R ";
897 AppendString(stream.str().c_str());
898 pages_objsize += stream.str().size();
899 }
900 stream.str("");
901 stream << "]\n /Count " << pages_.size() << "\n>>\nendobj\n";
902 AppendString(stream.str().c_str());
903 pages_objsize += stream.str().size();
904 offsets_.back() += pages_objsize; // manipulation #2
905
906 // INFO
907 STRING utf16_title = "FEFF"; // byte_order_marker
908 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
909 char utf16[kMaxBytesPerCodepoint];
910 for (char32 code : unicodes) {
911 if (CodepointToUtf16be(code, utf16)) {
912 utf16_title += utf16;
913 }
914 }
915
916 char* datestr = l_getFormattedDate();
917 stream.str("");
918 stream
919 << obj_ << " 0 obj\n"
920 "<<\n"
921 " /Producer (Tesseract " << tesseract::TessBaseAPI::Version() << ")\n"
922 " /CreationDate (D:" << datestr << ")\n"
923 " /Title <" << utf16_title.c_str() << ">\n"
924 ">>\n"
925 "endobj\n";
926 lept_free(datestr);
927 AppendPDFObject(stream.str().c_str());
928 stream.str("");
929 stream << "xref\n0 " << obj_ << "\n0000000000 65535 f \n";
930 AppendString(stream.str().c_str());
931 for (int i = 1; i < obj_; i++) {
932 stream.str("");
933 stream.width(10);
934 stream.fill('0');
935 stream << offsets_[i] << " 00000 n \n";
936 AppendString(stream.str().c_str());
937 }
938 stream.str("");
939 stream
940 << "trailer\n<<\n /Size " << obj_ << "\n"
941 " /Root 1 0 R\n" // catalog
942 " /Info " << (obj_ - 1) << " 0 R\n" // info
943 ">>\nstartxref\n" << offsets_.back() << "\n%%EOF\n";
944 AppendString(stream.str().c_str());
945 return true;
946}
947} // namespace tesseract
struct TessBaseAPI TessBaseAPI
Definition: capi.h:93
void Swap(T *p1, T *p2)
Definition: helpers.h:95
DLLSYM void tprintf(const char *format,...)
Definition: tprintf.cpp:35
@ DIR_RIGHT_TO_LEFT
Definition: unichar.h:44
@ DIR_LEFT_TO_RIGHT
Definition: unichar.h:43
signed int char32
signed int char32
Definition: unichar.h:51
bool DeSerialize(FILE *fp, char *data, size_t n)
Definition: serialis.cpp:28
@ WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:138
@ WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:136
@ WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:137
int push_back(T object)
int size() const
Definition: genericvector.h:72
size_t unsigned_size() const
Definition: genericvector.h:76
T & back() const
const char * GetInputName()
Definition: baseapi.cpp:960
bool GetIntVariable(const char *name, int *value) const
Definition: baseapi.cpp:298
void AppendString(const char *s)
Definition: renderer.cpp:102
const char * title() const
Definition: renderer.h:88
void AppendData(const char *s, int len)
Definition: renderer.cpp:106
Definition: strngs.h:45
const char * c_str() const
Definition: strngs.cpp:205
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:215