Follow @devglan

Parse Word Document Using Apache POI

By Dhiraj, 03 June, 2017

In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.

Maven Dependencies

Following is the poi maven depedency required to read word documents. For latest artifacts visit here

pom.xml

	<dependencies>
		<dependency>
                     <groupId>org.apache.poi</groupId>
                     <artifactId>poi-ooxml</artifactId>
		     <version>3.16</version>
                 </dependency>
	</dependencies>

Reading Complete Text from Word Document

The class XWPFDocument has many methods defined to read and extract .docx file contents. getText() can be used to read all the texts in a .docx word document. Following is an example.

TextReader.java

public class TextReader {
	
	public static void main(String[] args) {
	 try {
		   FileInputStream fis = new FileInputStream("test.docx");
		   XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
		   XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
		   System.out.println(extractor.getText());
		} catch(Exception ex) {
		    ex.printStackTrace();
		}
 }

}

Reading Headers and Foooters of Word Document

Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.

HeaderFooter.java

public class HeaderFooterReader {

	public static void main(String[] args) {
		
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);

			XWPFHeader header = policy.getDefaultHeader();
			if (header != null) {
				System.out.println(header.getText());
			}

			XWPFFooter footer = policy.getDefaultFooter();
			if (footer != null) {
				System.out.println(footer.getText());
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Output

This is Header This is footer

 Other Interesting Posts
Java 8 Lambda Expression
Java 8 Stream Operations
Java 8 Datetime Conversions
Random Password Generator in Java

Read Each Paragraph of a Word Document

Among the many methods defined in XWPFDocument class, we can use getParagraphs() to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.

To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.

ParagraphReader.java

public class ParagraphReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				System.out.println(paragraph.getText());
				System.out.println(paragraph.getAlignment());
				System.out.print(paragraph.getRuns().size());
				System.out.println(paragraph.getStyle());

				// Returns numbering format for this paragraph, eg bullet or lowerLetter.
				System.out.println(paragraph.getNumFmt());
				System.out.println(paragraph.getAlignment());

				System.out.println(paragraph.isWordWrapped());

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Tables from Word Document

Following is an example to read tables present in a word document. It will print all the text rows wise.

TableReader.java

public class TableReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			Iterator bodyElementIterator = xdoc.getBodyElementsIterator();
			while (bodyElementIterator.hasNext()) {
				IBodyElement element = bodyElementIterator.next();

				if ("TABLE".equalsIgnoreCase(element.getElementType().name())) {
					List tableList = element.getBody().getTables();
					for (XWPFTable table : tableList) {
						System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows());
						for (int i = 0; i < table.getRows().size(); i++) {

							for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) {
								System.out.println(table.getRow(i).getCell(j).getText());
							}
						}
					}
				}
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Styles from Word Document

Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.

StyleReader.java

public class StyleReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				for (XWPFRun rn : paragraph.getRuns()) {

					System.out.println(rn.isBold());
					System.out.println(rn.isHighlighted());
					System.out.println(rn.isCapitalized());
					System.out.println(rn.getFontSize());
				}

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Reading Image from Word Document

Following is an example to read image files from a word document.

public class ImageReader {

	public static void main(String[] args) {

		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			List pic = xdoc.getAllPictures();
			if (!pic.isEmpty()) {
				System.out.print(pic.get(0).getPictureType());
				System.out.print(pic.get(0).getData());
			}

		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}

}

Conclusion

I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.

Download source

If You Appreciate This, You Can Consider:

We are thankful for your never ending support.

About The Author

A technology savvy professional with an exceptional capacity to analyze, solve problems and multi-task. Technical expertise in highly scalable distributed systems, self-healing systems, and service-oriented architecture. Technical Skills: Java/J2EE, Spring, Hibernate, Reactive Programming, Microservices, Hystrix, Rest APIs, Java 8, Kafka, Kibana, Elasticsearch, etc.