Standford NLP Tokenization Maven Eclipse Example

By Dhiraj Ray, 10 July,2017  
545
standford-nlp

This tutorial is about setting up standford NLP in eclipse IDE with maven. Here we will be creating an example to tokenize any raw text. We wil be using maven to build our project and define different dependencies related to Standford NLP. Apart from setting up the standford NLP in eclipse, we will also take a look into how DocumentPreprocessor and PTBTokenizer can be used to tokenize any raw text.

What is Stanford Tokenizer

Stanford Tokenizer divides text into a sequence of tokens, which roughly correspond to "words". Stanford also provides PTBTokenizer to tokenize formal english.

We will be creating an example using both the tokenizer to tokenize raw text.

Project Structure

standford-nlp-project-strct

Maven Dependencies

pom.xml
<dependencies> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> </dependencies>

Implementing StandfordTokenizer Using DocumentPreprocessor

StandfordTokenizer.java
package com.devglan; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.process.DocumentPreprocessor; import java.util.List; public class StandfordTokenizer { public DocumentPreprocessor tokenize(String fileName){ DocumentPreprocessor dp = new DocumentPreprocessor(fileName); for (List sentence : dp) { System.out.println(sentence); } return dp; } }

Implementing StandfordTokenizer Using Standford PTBTokenizer

PTBTokenizerExample.java
package com.devglan; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.process.PTBTokenizer; import java.io.FileNotFoundException; import java.io.FileReader; import java.util.HashSet; import java.util.Set; public class PTBTokenizerExample { public Set tokenize(String fileName) throws FileNotFoundException { Set labels = new HashSet<>(); PTBTokenizer ptbt = new PTBTokenizer<>(new FileReader(fileName), new CoreLabelTokenFactory(), ""); while (ptbt.hasNext()) { CoreLabel label = ptbt.next(); System.out.println(label); labels.add(label); } return labels; } }

Testing the Application

Following are some test cases to test Standford tokenizer.

TokenizerTest.java
package com.devglan; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.process.DocumentPreprocessor; import org.junit.Assert; import org.junit.Test; import java.io.IOException; import java.util.Set; public class TokenizerTest { @Test public void SentenceDetectorTest() throws IOException { StandfordTokenizer tokenizer = new StandfordTokenizer(); DocumentPreprocessor dp = tokenizer.tokenize("standford.txt"); Assert.assertTrue(dp != null); } @Test public void SentencePosDetectorTest() throws IOException { PTBTokenizerExample tokenizer = new PTBTokenizerExample(); Set labels = tokenizer.tokenize("C:/D/workspaces/standfordsetupdemo/src/main/resources/standford.txt"); Assert.assertTrue(labels != null && labels.size() > 0); } }

Output

standford-nlp-output

Conclusion

I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.

Download the source

References

Standford Tokenization

Tokenizer

Stanford NLP Tokenizer

Suggest more topics in suggestion section or write your own article and share with your colleagues.

Is this page helpful to you? Please give us your feedback below. We would love to hear your thoughts on these articles, it will help us improve further our learning process.

Further Reading: