神奇的大作业:HTML转PDF的java项目
PDF格式是一种非常流行的文档格式,在简历/电子图书/参考文档等应用都十分广泛。笔者所在的公司最近也有一个项目,需要将html的图文报告转为PDF格式输出。笔者费了一番功夫,在Github上找到了谷歌的一个开源项目,阅读源码后发现孺子可教,经过修改后的源码完全能达到令产品满意的转换效果。今天咱们就来分享下这个项目。
Definition
- 词法
-
词汇通常用正则表达式表示
INTEGER :0|[1-9][0-9]*
PLUS : +
MINUS: - - 语法
-
语法通常使用一种称为 BNF 的格式来定义
expression := term operation term
operation := PLUS | MINUS
term := INTEGER | expression - iText
-
iText是一个操作PDF文档的开源项目,由java和.net编写。
iText在能够创建标准PDF的同时,还能将XML、HTML、Web表单、CSS或者其他数据库文件转换成PDF,而且保证格式的标准统一。
iText可以切割、合并文档,还对页面进行复制、导入和覆盖的操作,同时可以加入、编写结构更加丰富的多样化内容,比如条形码、水印、印章、表格和图片等。
浏览器解析HTML过程
作为超文本标记语言,HTML定义了展示网页信息一种规范。浏览器在解释HTML生成最终视图的时候,大概是这样的:
1. 解析文档;
2. 布局,为每个节点分配一个应出现在屏幕上的确切坐标;
3. 绘制,呈现引擎会遍历呈现树,由用户界面后端层将每个节点绘制出来;
4. 显示,值得注意的是这一步并不会等到文档解析完成,会将部分已解析的文档尽快显示。
图1:浏览器主要组件

图2:呈现引擎的基本流程

图3:WebKit 主流程

//假设一个HTML文档如下
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
</head>
<body>
<!-- 诗词展示 -->
<div style="font-family: msyh;text-align: center;">
<h2>早发白帝城</h2>
<div style="padding: 20px 20px; background-color: rgb(250, 192, 143);font-family: simkai;">
朝辞白帝彩云间,千里江陵一日还。<br />
两岸猿声啼不住,轻舟已过万重山。<br />
</div>
<h2>赠汪伦</h2>
<div style="padding: 20px 20px; background-color: rgb(250, 192, 143);font-family: simkai;">
李白乘舟将欲行,忽闻岸上踏歌声。<br />
桃花潭水深千尺,不及汪伦送我情。<br />
</div>
<h2>望庐山瀑布</h2>
<div style="padding: 20px 20px; background-color: rgb(250, 192, 143);font-family: simkai;">
日照香炉生紫烟,遥看瀑布挂前川。<br />
飞流直下三千尺,疑是银河落九天。<br />
</div>
</div>
<!-- 古词释义 -->
<div style="text-shadow: 5px 5px 5px #787878;color: #fff;background-color: #CDC9C9;">
<p><span style="margin-right: 10px;">☆</span>发:启程。白帝城:故址在今重庆市奉节县白帝山上。</p>
<p><span style="margin-right: 10px;">☆</span>踏歌:唐代民间流行的一种手拉手、两足踏地为节拍的歌舞形式,可以边走边唱。</p>
<p><span style="margin-right: 10px;">☆</span>桃花潭:在今安徽泾县西南一百里。《一统志》谓其深不可测。深千尺:诗人用潭水深千尺比喻汪伦与他的友情,运用了夸张的手法。</p>
</div>
</body>
</html>
需要解析这个文档,我们需要知道:文档中每个符号代表的意义不是固定的,而是配合当前文档上下文的语义来解释。这意味着,读取同样的字符,可能因为当前状态的不同,得到不同的下一个状态。
那么我们可以用数学的方式来解这道难题:
将解析HTML过程中当前状态抽象为“状态”,所有解析操作放在当前“状态”下执行。大概是这样的:
逐个读取文档字符,根据当前状态和读取到字符,进入对应“状态”解析,并触发HTML文档生命周期事件。
首先我们先抽象出HTML文档的生命周期:
/**
* 表示解析HTML文档的生命周期
*
* @author 玄葬
*
*/
public interface HTMLLifeCycle {
public HTMLLifeCycle addHTMLLifeCycleListenter(HTMLLifeCycleListener l);
public HTMLLifeCycle removeHTMLLifeCycleListenter(HTMLLifeCycleListener l);
/**
* 当在开始标签前读取到任何内容时触发
*/
public void unknownText();
/**
* 当读取到开始标签时触发
*/
public void startElement();
/**
* 当读取到结束标签时触发
*/
public void endElement();
/**
* 当读取到注释时触发
*/
public void comment();
}
然后我们抽象出HTML文档的生命周期事件:
/**
* 观察者设计模式
* 监听{@link HTMLLifeCycle}的生命周期
*
* @author 玄葬
*
*/
public interface HTMLLifeCycleListener {
/**
* 标签开始----->pipeline open方法----->生成PDF
*/
void startElement(String tag, Map attributes, String ns);
/**
* 标签结束----->pipeline close方法----->生成PDF
*/
void endElement(String tag, String ns);
/**
* 标签内容----->pipeline content方法----->生成PDF
*/
void text(String text);
/**
* 标签外内容
*/
void unknownText(String text);
/**
* 注释
*/
void comment(String comment);
/**
* 解析开始
*/
void init();
/**
* 解析结束
*/
void close();
}
接下来我们就可以具体实现对HTML文档的解析了,如下面的解析类,实现了HTMLLifeCycle接口,里面的parse方法是解析入口:
/**
* HTML符号识别算法
* 每次读取一个输入字符,并根据这些字符转移到下一个状态,当前的符号状态和当前Dom树状态共同影响结果。
* 这意味着,读取同样的字符,可能因为当前状态的不同,得到不同的下一个状态。
*
* @param r
* @throws IOException
*/
public void parse(final Reader r) throws IOException {
for (HTMLLifeCycleListener l : listeners) {
l.init();
}
char read[] = new char[1];
try {
while (-1 != (r.read(read))) {
state.process(read[0]);
}
} finally {
for (HTMLLifeCycleListener l : listeners) {
l.close();
}
r.close();
}
}
在上面的代码中,state为HTML文档的当前状态,例如读取到标签开始字符“<”,会进入下面这个状态,TagEncounteredState状态下是这样解析字符的:
public class TagEncounteredState implements State {
private final HTMLParser parser;
/**
* @param parser the HTMLParser
*/
public TagEncounteredState(final HTMLParser parser) {
this.parser = parser;
}
@Override
public void process(final char character) {
String tag = this.parser.memory().textBuffString();
if (Character.isWhitespace(character) || character == '>' || character == '/' || character == ':' || character == '?' || tag.equals("!--")) {
if (tag.length() > 0) {
if (tag.equals("!DOCTYPE")) { //<!DOCTYPE html>
this.parser.memory().resetTextBuffer();
this.parser.memory().append(character);
this.parser.stateController().doctype();
}
else if (tag.equals("!--")) { //<!-- this is a comment -->
this.parser.memory().resetTextBuffer();
this.parser.memory().resetCommentBuff();
this.parser.stateController().comment();
/**
* 避免'<!---->'这样的注释出错
* 详见CommentState和CloseCommentState
*/
if (character == '-') {
this.parser.memory().commentBuff().append(character);
} else {
this.parser.memory().append(character);
}
}
else if (Character.isWhitespace(character)) { //<p style="font-size: 14px;">
this.parser.memory().setCurrentTag(tag);
this.parser.memory().resetTextBuffer();
this.parser.stateController().tagAttributes();
}
else if (character == '>') {
this.parser.memory().setCurrentTag(tag);
this.parser.memory().resetTextBuffer();
this.parser.startElement();
this.parser.stateController().inTag();
}
else if (character == '/') {
this.parser.memory().setCurrentTag(tag);
this.parser.memory().resetTextBuffer();
this.parser.stateController().selfClosing();
}
else if (character == ':') {
this.parser.memory().setCurrentNameSpace(tag);
this.parser.memory().resetTextBuffer();
}
} else {
if (character == '/') { //</div
this.parser.stateController().closingTag();
}
else if (character == '?') { //<? xml
this.parser.memory().append(character);
this.parser.stateController().processingInstructions();
}
}
} else {
this.parser.memory().append(character); //<div
}
}
}
TagEncounteredState状态中,当读取到“>”字符时,触发了HTMLLifeCycle的startElement事件,并由HTMLLifeCycleListener监听事件。对于HTMLLifeCycle的每个事件,我们基于观察者设计设计模式观察并处理,例如text事件为写入内容:
@Override
public void text(String text) {
if (text.startsWith("")) {
return;
}
if (null != this.tag) {
Pipeline> p = rootpPipe;
ProcessObject po = new ProcessObject();
try {
while((p = p.content(context, this.tag, text, po)) != null);
} catch (PipelineException e) {
throw new RuntimeWorkerException(e);
}
}
}
我们在HTML的startElement、endElement、text事件中接收事件参数,然后进行翻译,具体实现可以参考以下代码:
@Override
public void startElement(String tag, Map attributes, String ns) {
Tag t = new Tag(tag, attributes, ns);
if (this.tag != null) {
this.tag.addChild(t);
}
this.tag = t;
ProcessObject po = new ProcessObject();
Pipeline> p = rootpPipe;
try {
while((p = p.open(context, t, po)) != null);
} catch (PipelineException e) {
throw new RuntimeWorkerException(e);
}
}
@Override
public void endElement(String tag, String ns) {
tag = tag.toLowerCase();
if (this.tag != null && !this.tag.getName().equals(tag)) { //判断标签是否闭合
throw new RuntimeWorkerException(String.format(
LocaleMessages.getInstance().getMessage(LocaleMessages.INVALID_NESTED_TAG), tag, this.tag.getName()));
}
Pipeline> p = rootpPipe;
ProcessObject po = new ProcessObject();
try {
while((p = p.close(context, this.tag, po)) != null);
} catch (PipelineException e) {
throw new RuntimeWorkerException(e);
} finally {
if (null != this.tag)
this.tag = this.tag.getParent();
}
}
@Override
public void text(String text) {
if (text.startsWith("")) {
return;
}
if (null != this.tag) {
Pipeline> p = rootpPipe;
ProcessObject po = new ProcessObject();
try {
while((p = p.content(context, this.tag, text, po)) != null);
} catch (PipelineException e) {
throw new RuntimeWorkerException(e);
}
}
}
其中rootpPipe为我们的管道开头,写入文本的过程中,我们使用责任链模式来分步骤翻译HTML,大概是这个样子的:
接收解析好的Tag标签,并读取影响该标签的CSS样式。----->将这个标签转换为itext元素,并渲染CSS样式到itext元素。----->将itext元素写入document。
于是有了以下代码:
// Pipelines
PdfWriterPipeline pdfWriterPipeline = new PdfWriterPipeline(doc, writer);
HtmlPipeline htmlPipeline = new HtmlPipeline(hpc, pdfWriterPipeline);
CssResolverPipeline cssResolverPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
其中CssResolverPipeline接收解析好的Tag标签,并读取影响该标签的CSS样式;HtmlPipeline将标签转换为文档元素,并渲染CSS样式到元素;PdfWriterPipeline将元素写入document。主要的思路大概是这样子的:
//我们定义一个Tag类表示HTML标签
public class Tag implements Iterable {
private Tag parent;
private final String tag;
private final Map attributes;
private Map css;
private final List children;
private final String ns;
private Object lastMarginBottom = null;
/**
* CssResolverPipeline接收Tags标签并渲染CSS样式
*
* @author 玄葬
*
*/
public class CssResolverPipeline extends AbstractPipeline {
private CSSResolver cssResolver;
public CssResolverPipeline(final CSSResolver cssResolver, final Pipeline> next) {
super(next);
this.cssResolver = cssResolver;
}
@Override
public String getContextKey() {
return CssResolverPipeline.class.getName();
}
@Override
public Pipeline> init(WorkerContext context) throws PipelineException {
try {
CSSResolver ctx = cssResolver.clear(); // 使用CSSResolver上下文之前先清空非持久化CSS文件
context.put(getContextKey(), ctx);
return getNext();
} catch (CssResolverException e) {
throw new PipelineException(e);
}
}
@Override
public Pipeline> open(WorkerContext context, Tag t, ProcessObject po) throws PipelineException {
CSSResolver cssResolver = getLocalContext(context);
cssResolver.resolve(t);
return getNext();
}
}
其中CssResolverPipeline读取能影响该标签CSS过程大概是这样子的(支持文本和流的形式读取CSS文件):
@Override
public void resolve(Tag t) {
Map css = t.getCSS(); //标签最终的CSS
Map tagCss = new LinkedHashMap(); //从CSS样式表和标签style属性获取的CSS
// 解析CSS文件
if (null != cssFiles && cssFiles.hasFiles()) {
tagCss = cssFiles.getCSS(t);
if (t.getName().equalsIgnoreCase(HTML.Tag.P) || t.getName().equalsIgnoreCase(HTML.Tag.TD)) {
Map listCss = cssFiles.getCSS(new Tag(HTML.Tag.UL));
if (listCss.containsKey(CSS.Property.LIST_STYLE_TYPE)) { // list-style-type的样式
css.put(CSS.Property.LIST_STYLE_TYPE, listCss.get(CSS.Property.LIST_STYLE_TYPE));
}
}
}
// 解析style属性
Map attributes = t.getAttributes();
if (null != attributes && !attributes.isEmpty()) {
if (attributes.get(HTML.Attribute.CELLPADDING) != null) {
tagCss.putAll(utils.parseBoxValues(attributes.get(HTML.Attribute.CELLPADDING), "cellpadding-", ""));
}
if (attributes.get(HTML.Attribute.CELLSPACING) != null) {
tagCss.putAll(utils.parseBoxValues(attributes.get(HTML.Attribute.CELLSPACING), "cellspacing-", ""));
}
String style = attributes.get(HTML.Attribute.STYLE);
if (null != style && style.length() > 0) {
Map styleCss = new LinkedHashMap();
String[] styles = style.split(";");
for (String s : styles) {
String[] part = s.split(":", 2);
if (part.length == 2) {
String key = utils.stripDoubleSpacesTrimAndToLowerCase(part[0]);
String value = utils.stripDoubleSpacesAndTrim(part[1]);
parseAttributeValue(styleCss, key, value);
}
}
tagCss.putAll(styleCss);
}
}
// 特殊标签处理
if (t.getName() != null) {
if(t.getName().equalsIgnoreCase(HTML.Tag.I) || t.getName().equalsIgnoreCase(HTML.Tag.CITE)
|| t.getName().equalsIgnoreCase(HTML.Tag.EM) || t.getName().equalsIgnoreCase(HTML.Tag.VAR)
|| t.getName().equalsIgnoreCase(HTML.Tag.DFN) || t.getName().equalsIgnoreCase(HTML.Tag.ADDRESS)) {
tagCss.put(CSS.Property.FONT_STYLE, CSS.Value.ITALIC);
}
else if (t.getName().equalsIgnoreCase(HTML.Tag.B) || t.getName().equalsIgnoreCase(HTML.Tag.STRONG)) {
tagCss.put(CSS.Property.FONT_WEIGHT, CSS.Value.BOLD);
}
else if (t.getName().equalsIgnoreCase(HTML.Tag.U) || t.getName().equalsIgnoreCase(HTML.Tag.INS)) {
tagCss.put(CSS.Property.TEXT_DECORATION, CSS.Value.UNDERLINE);
}
else if (t.getName().equalsIgnoreCase(HTML.Tag.S) || t.getName().equalsIgnoreCase(HTML.Tag.STRIKE)
|| t.getName().equalsIgnoreCase(HTML.Tag.DEL)) {
tagCss.put(CSS.Property.TEXT_DECORATION, CSS.Value.LINE_THROUGH);
}
else if (t.getName().equalsIgnoreCase(HTML.Tag.BIG)){
tagCss.put(CSS.Property.FONT_SIZE, CSS.Value.LARGER);
}
else if (t.getName().equalsIgnoreCase(HTML.Tag.SMALL)){
tagCss.put(CSS.Property.FONT_SIZE, CSS.Value.SMALLER);
}
else if (t.getName().equals(HTML.Tag.FONT)) {
String font_family = t.getAttributes().get(HTML.Attribute.FACE);
if (font_family != null) css.put(CSS.Property.FONT_FAMILY, font_family);
String color = t.getAttributes().get(HTML.Attribute.COLOR);
if (color != null) css.put(CSS.Property.COLOR, color);
String size = t.getAttributes().get(HTML.Attribute.SIZE);
if (size != null) {
if(size.equals("1")) css.put(CSS.Property.FONT_SIZE, CSS.Value.XX_SMALL);
else if(size.equals("2")) css.put(CSS.Property.FONT_SIZE, CSS.Value.X_SMALL);
else if(size.equals("3")) css.put(CSS.Property.FONT_SIZE, CSS.Value.SMALL);
else if(size.equals("4")) css.put(CSS.Property.FONT_SIZE, CSS.Value.MEDIUM);
else if(size.equals("5")) css.put(CSS.Property.FONT_SIZE, CSS.Value.LARGE);
else if(size.equals("6")) css.put(CSS.Property.FONT_SIZE, CSS.Value.X_LARGE);
else if(size.equals("7")) css.put(CSS.Property.FONT_SIZE, CSS.Value.XX_LARGE);
}
}
else if (t.getName().equals(HTML.Tag.A)) {
css.put(CSS.Property.TEXT_DECORATION, CSS.Value.UNDERLINE);
css.put(CSS.Property.COLOR, "blue");
}
}
// 解析父类可继承属性
if (null != t.getParent() && null != t.getParent().getCSS()) {
Map parentCss = t.getParent().getCSS();
for (Entry pc : parentCss.entrySet()) {
String key = pc.getKey();
String value = pc.getValue();
if ((tagCss.containsKey(key) && CSS.Value.INHERIT.equalsIgnoreCase(tagCss.get(key))) || (!tagCss.containsKey(key) && canInherite(t, key))) {
if (key.contains(CSS.Property.CELLPADDING) && (HTML.Tag.TD.equals(t.getName()) || HTML.Tag.TH.equals(t.getName()))) {
String paddingKey = key.replace(CSS.Property.CELLPADDING, CSS.Property.PADDING); // 将TD和TH元素cellpadding属性转为padding,PDF元素转换只支持padding属性?
tagCss.put(paddingKey, value);
}else{
css.put(key, value);
}
}
}
}
// 加到最终CSS,如果value!=inherit则覆盖
for (Entry kv : tagCss.entrySet()) {
if (!kv.getValue().equalsIgnoreCase(CSS.Value.INHERIT)) {
if (kv.getKey().equals(CSS.Property.TEXT_DECORATION)) {
String oldValue = css.get(kv.getKey());
css.put(kv.getKey(), mergeTextDecorationRules(oldValue, kv.getValue()));
}else{
css.put(kv.getKey(), kv.getValue());
}
}
}
}
/**
* HtmlPipeline将标签和文本转换为PDF Elements
*
* @author 玄葬
*
*/
public class HtmlPipeline extends AbstractPipeline {
private final HtmlPipelineContext hpc;
public HtmlPipeline(final HtmlPipelineContext hpc, final Pipeline> next) {
super(next);
this.hpc = hpc;
}
@Override
public String getContextKey() {
return HtmlPipeline.class.getName();
}
@Override
public Pipeline> init(final WorkerContext context) throws PipelineException {
context.put(getContextKey(), hpc);
return getNext();
}
@Override
public Pipeline> open(final WorkerContext context, final Tag t, final ProcessObject po) throws PipelineException {
HtmlPipelineContext hcc = getLocalContext(context);
try {
t.setLastMarginBottom(hcc.getMemory().get(HtmlPipelineContext.LAST_MARGIN_BOTTOM));
hcc.getMemory().remove(HtmlPipelineContext.LAST_MARGIN_BOTTOM);
TagProcessor tp = hcc.getProcessor(t.getName(), t.getNameSpace());
addStackKeeper(t, hcc, tp);
List content = tp.startElement(context, t);
if (content.size() > 0) {
if (tp.isStackOwner()) {
StackKeeper peek = hcc.peek();
if (peek == null)
throw new PipelineException(String.format(LocaleMessages.STACK_404, t.toString()));
for (Element elem : content) {
peek.add(elem);
}
} else {
for (Element elem : content) {
hcc.getElements().add(elem);
if (elem.type() == Element.BODY ){
WritableElement writableElement = new WritableElement();
writableElement.add(elem);
po.add(writableElement);
hcc.getElements().remove(elem);
}
}
}
}
} catch (NoTagProcessorException e) {
if (!hcc.acceptUnknown()) {
throw e;
}
}
return getNext();
}
@Override
public Pipeline> content(final WorkerContext context, final Tag t, final String text, final ProcessObject po)
throws PipelineException {
HtmlPipelineContext hcc = getLocalContext(context);
TagProcessor tp;
try {
tp = hcc.getProcessor(t.getName(), t.getNameSpace());
// String ctn = null;
// if (null != hcc.charSet()) {
// try {
// ctn = new String(b, hcc.charSet().name());
// } catch (UnsupportedEncodingException e) {
// throw new RuntimeWorkerException(LocaleMessages.getInstance().getMessage(
// LocaleMessages.UNSUPPORTED_CHARSET), e);
// }
// } else {
// ctn = new String(b);
// }
List elems = tp.content(context, t, text);
if (elems.size() > 0) {
StackKeeper peek = hcc.peek();
if (peek != null) {
for (Element e : elems) {
peek.add(e);
}
} else {
WritableElement writableElement = new WritableElement();
for (Element elem : elems) {
writableElement.add(elem);
}
po.add(writableElement);
}
}
} catch (NoTagProcessorException e) {
if (!hcc.acceptUnknown()) {
throw e;
}
}
return getNext();
}
@Override
public Pipeline> close(final WorkerContext context, final Tag t, final ProcessObject po) throws PipelineException {
HtmlPipelineContext hcc = getLocalContext(context);
TagProcessor tp;
try {
if (t.getLastMarginBottom() != null) {
hcc.getMemory().put(HtmlPipelineContext.LAST_MARGIN_BOTTOM, t.getLastMarginBottom());
} else {
hcc.getMemory().remove(HtmlPipelineContext.LAST_MARGIN_BOTTOM);
}
tp = hcc.getProcessor(t.getName(), t.getNameSpace());
List elems = null;
if (tp.isStackOwner()) {
// remove the element from the StackKeeper Queue if end tag is
// found
StackKeeper tagStack;
try {
tagStack = hcc.poll();
} catch (NoStackException e) {
throw new PipelineException(String.format(
LocaleMessages.getInstance().getMessage(LocaleMessages.STACK_404), t.toString()), e);
}
elems = tp.endElement(context, t, tagStack.getElements());
} else {
elems = tp.endElement(context, t, hcc.getElements());
hcc.getElements().clear();
}
if (elems.size() > 0) {
StackKeeper stack = hcc.peek();
if (stack != null) {
for (Element elem : elems) {
stack.add(elem);
}
} else {
WritableElement writableElement = new WritableElement();
po.add(writableElement);
writableElement.addAll(elems);
}
}
} catch (NoTagProcessorException e) {
if (!hcc.acceptUnknown()) {
throw e;
}
}
return getNext();
}
protected void addStackKeeper(Tag t, HtmlPipelineContext hcc, TagProcessor tp) {
if (tp.isStackOwner()) {
hcc.addFirst(new StackKeeper(t));
}
}
}
/**
* This pipeline writes to a Document.
* @author redlab_b
*
*/
public class PdfWriterPipeline extends AbstractPipeline {
private static final Logger LOG = LoggerFactory.getLogger(PdfWriterPipeline.class);
private Document doc;
private PdfWriter writer;
/**
*/
public PdfWriterPipeline() {
super(null);
}
/**
* @param next the next pipeline if any.
*/
public PdfWriterPipeline(final Pipeline> next) {
super(next);
}
/**
* @param doc the document
* @param writer the writer
*/
public PdfWriterPipeline(final Document doc, final PdfWriter writer) {
super(null);
this.doc = doc;
this.writer = writer;
continiously = true;
}
/**
* The key for the {@link Document} in the {@link MapContext} used as {@link CustomContext}.
*/
public static final String DOCUMENT = "DOCUMENT";
/**
* The key for the {@link PdfWriter} in the {@link MapContext} used as {@link CustomContext}.
*/
public static final String WRITER = "WRITER";
/**
* The key for the a boolean in the {@link MapContext} used as {@link CustomContext}. Setting to true enables swallowing of DocumentExceptions
*/
public static final String CONTINUOUS = "CONTINUOUS";
private Boolean continiously;
/* (non-Javadoc)
* @see com.itextpdf.tool.xml.pipeline.AbstractPipeline#init(com.itextpdf.tool.xml.WorkerContext)
*/
@Override
public Pipeline> init(final WorkerContext context) throws PipelineException {
MapContext mc = new MapContext();
continiously = Boolean.TRUE;
mc.put(CONTINUOUS, continiously);
if (null != doc) {
mc.put(DOCUMENT, doc);
}
if (null != writer) {
mc.put(WRITER, writer);
}
context.put(getContextKey(), mc);
return super.init(context);
}
/**
* @param po
* @throws PipelineException
*/
private void write(final WorkerContext context, final ProcessObject po) throws PipelineException {
MapContext mp = getLocalContext(context);
if (po.containsWritable()) {
Document doc = (Document) mp.get(DOCUMENT);
boolean continuousWrite = (Boolean) mp.get(CONTINUOUS);
Writable writable = null;
while (null != (writable = po.poll())) {
if (writable instanceof WritableElement) {
for (Element e : ((WritableElement) writable).elements()) {
try {
if (!doc.add(e)) {
LOG.trace(String.format(
LocaleMessages.getInstance().getMessage(LocaleMessages.ELEMENT_NOT_ADDED),
e.toString()));
}
} catch (DocumentException e1) {
if (!continuousWrite) {
throw new PipelineException(e1);
} else {
LOG.error(
LocaleMessages.getInstance().getMessage(LocaleMessages.ELEMENT_NOT_ADDED_EXC),
e1);
}
}
}
}
}
}
}
/*
* (non-Javadoc)
*
* @see com.itextpdf.tool.xml.pipeline.Pipeline#open(com.itextpdf.tool.
* xml.Tag, com.itextpdf.tool.xml.pipeline.ProcessObject)
*/
@Override
public Pipeline> open(final WorkerContext context, final Tag t, final ProcessObject po) throws PipelineException {
write(context, po);
return getNext();
}
/*
* (non-Javadoc)
*
* @see com.itextpdf.tool.xml.pipeline.Pipeline#content(com.itextpdf.tool
* .xml.Tag, java.lang.String, com.itextpdf.tool.xml.pipeline.ProcessObject)
*/
@Override
public Pipeline> content(final WorkerContext context, final Tag currentTag, final String text, final ProcessObject po) throws PipelineException {
write(context, po);
return getNext();
}
/*
* (non-Javadoc)
*
* @see com.itextpdf.tool.xml.pipeline.Pipeline#close(com.itextpdf.tool
* .xml.Tag, com.itextpdf.tool.xml.pipeline.ProcessObject)
*/
@Override
public Pipeline> close(final WorkerContext context, final Tag t, final ProcessObject po) throws PipelineException {
write(context ,po);
return getNext();
}
/**
* The document to write to.
* @param document the Document
*/
public void setDocument(final Document document) {
this.doc = document;
}
/**
* The writer used to write to the document.
* @param writer the writer.
*/
public void setWriter(final PdfWriter writer) {
this.writer = writer;
}
@Override
public String getContextKey() {
return PdfWriterPipeline.class.getName();
}
}
在标签----->元素的转换过程中,CSS渲染是最具有技术难度的工作,例如将一个<p>标签渲染成Chunk元素
@Override
public List content(final WorkerContext ctx, final Tag tag, final String content) {
List sanitizedChunks = HTMLUtils.sanitize(content, false);
List l = new ArrayList(1);
for (Chunk sanitized : sanitizedChunks) {
try {
HtmlPipelineContext hpc = getHtmlPipelineContext(ctx);
if ((null != tag.getCSS().get(CSS.Property.TAB_INTERVAL))) {
TabbedChunk tabbedChunk = new TabbedChunk(sanitized.getContent());
if (null != getLastChild(tag) && null != getLastChild(tag).getCSS().get(CSS.Property.XFA_TAB_COUNT)) {
tabbedChunk.setTabCount(Integer.parseInt(getLastChild(tag).getCSS().get(CSS.Property.XFA_TAB_COUNT)));
}
l.add(getCssApplyService().apply(tabbedChunk, tag, hpc));
} else if (null != getLastChild(tag) && null != getLastChild(tag).getCSS().get(CSS.Property.XFA_TAB_COUNT)) {
TabbedChunk tabbedChunk = new TabbedChunk(sanitized.getContent());
tabbedChunk.setTabCount(Integer.parseInt(getLastChild(tag).getCSS().get(CSS.Property.XFA_TAB_COUNT)));
l.add(getCssApplyService().apply(tabbedChunk, tag, hpc));
} else {
l.add(getCssApplyService().apply(sanitized, tag, hpc));
}
} catch (NoCustomContextException e) {
throw new RuntimeWorkerException(e);
}
}
return l;
}
getCssApplyService()方法获取CSS渲染接口,该接口提供统一的apply()方法,自动判断需要渲染的元素类型,找到该元素的渲染类用于渲染CSS,例如ParagraphCssApplier类渲染Paragraph元素大概是这样的:
public Paragraph apply(final Paragraph p, final Tag t, final MarginMemory configuration, final PageSizeContainable psc, final HtmlPipelineContext ctx) {
//when turning html p tag to pdf Paragraph, height should be fixed
final CssUtils utils = CssUtils.getInstance();
float fontSize = FontSizeTranslator.getInstance().getFontSize(t);
if (fontSize == Font.UNDEFINED) fontSize = 0;
float lmb = 0;
boolean hasLMB = false;
Map css = t.getCSS();
for (Entry entry : css.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
if (CSS.Property.MARGIN_TOP.equalsIgnoreCase(key)) {
p.setSpacingBefore(p.getSpacingBefore() + utils.calculateMarginTop(value, fontSize, configuration));
} else if (CSS.Property.PADDING_TOP.equalsIgnoreCase(key)) {
p.setSpacingBefore(p.getSpacingBefore() + utils.parseValueToPt(value, fontSize));
p.setPaddingTop(utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.MARGIN_BOTTOM.equalsIgnoreCase(key)) {
float after = utils.parseValueToPt(value, fontSize);
p.setSpacingAfter(p.getSpacingAfter() + after);
lmb = after;
hasLMB = true;
} else if (CSS.Property.PADDING_BOTTOM.equalsIgnoreCase(key)) {
p.setSpacingAfter(p.getSpacingAfter() + utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.MARGIN_LEFT.equalsIgnoreCase(key)) {
p.setIndentationLeft(p.getIndentationLeft() + utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.MARGIN_RIGHT.equalsIgnoreCase(key)) {
p.setIndentationRight(p.getIndentationRight() + utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.PADDING_LEFT.equalsIgnoreCase(key)) {
p.setIndentationLeft(p.getIndentationLeft() + utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.PADDING_RIGHT.equalsIgnoreCase(key)) {
p.setIndentationRight(p.getIndentationRight() + utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.TEXT_ALIGN.equalsIgnoreCase(key)) {
p.setAlignment(CSS.getElementAlignment(value));
} else if (CSS.Property.TEXT_INDENT.equalsIgnoreCase(key)) {
p.setFirstLineIndent(utils.parseValueToPt(value, fontSize));
} else if (CSS.Property.LINE_HEIGHT.equalsIgnoreCase(key)) {
if(utils.isNumericValue(value)) {
p.setLeading(Float.parseFloat(value) * fontSize);
} else if (utils.isRelativeValue(value)) {
p.setLeading(utils.parseRelativeValue(value, fontSize));
} else if (utils.isMetricValue(value)){
p.setLeading(utils.parsePxInCmMmPcToPt(value));
}
}
}
if ( t.getAttributes().containsKey(HTML.Attribute.ALIGN)) {
String value = t.getAttributes().get(HTML.Attribute.ALIGN);
if ( value != null ) {
p.setAlignment(CSS.getElementAlignment(value));
}
}
// setDefaultMargin to largestFont if no margin-bottom is set and p-tag is child of the root tag.
/*if (null != t.getParent()) {
String parent = t.getParent().getName();
if (css.get(CSS.Property.MARGIN_TOP) == null && configuration.getRootTags().contains(parent)) {
p.setSpacingBefore(p.getSpacingBefore() + utils.calculateMarginTop(fontSize + "pt", 0, configuration));
}
if (css.get(CSS.Property.MARGIN_BOTTOM) == null && configuration.getRootTags().contains(parent)) {
p.setSpacingAfter(p.getSpacingAfter() + fontSize);
css.put(CSS.Property.MARGIN_BOTTOM, fontSize + "pt");
lmb = fontSize;
hasLMB = true;
}
//p.setLeading(m.getLargestLeading()); We need possibility to detect that line-height undefined;
if (p.getAlignment() == -1) {
p.setAlignment(Element.ALIGN_LEFT);
}
}*/
if (hasLMB) {
configuration.setLastMarginBottom(lmb);
}
ChunkCssApplier chunkCssApplier = (ChunkCssApplier) cssApplyService.getCssApplier(Chunk.class);
Font font = chunkCssApplier.applyFontStyles(t);
p.setFont(font);
// TODO reactive for positioning and implement more
// if(null != configuration.getWriter() && null != css.get("position")) {
// positionNoNewLineParagraph(p, css);
// p = null;
// }
return p;
}
剩下的工作就是写入Writable元素集合到document了,详见项目代码:https://github.com/linfengda/htmlworker.git
转换的效果大概是这样子的:

使用楷体转换的效果大概是这样子的:

实际项目中的效果:
