Jsoup HTML parser - Tutorial & examples

I heard about it a lot and I had the chance -finally- to use it on one of my projects. This is an introductory tutorial of the Jsoup HTML parser.

What is Jsoup?!

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

With Jsoup we are able to:

  • Scrape and parse HTML from a URL, file, or string
  • Find and extract data, using DOM traversal or CSS selectors
  • Manipulate the HTML elements, attributes and text
  • clean user-submitted content against a safe white-list, to prevent XSS attacks
  • Output tidy HTML

Using jsoup

To use jsoup in a Maven build, add the following dependency to your pom.

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.10.2</version>
</dependency>

To use jsoup in your Gradle build, add the following dependency to your build.gradle file.

 compile 'org.jsoup:jsoup:1.10.2'

Examples

1- Parsing a HTML string

In the first example, we are going to parse a HTML string.

  import org.jsoup.Jsoup;
  import org.jsoup.nodes.Document;

  public class JSoupFromStringEx {

      public static void main(String[] args) {
          
          String htmlString = "<html><head><title>My title</title></head>"
                  + "<body>Body content</body></html>";

          Document doc = Jsoup.parse(htmlString);
          String title = doc.title();
          String body = doc.body().text();
          
          System.out.printf("Title: %s%n", title);
          System.out.printf("Body: %s", body);
      }
  }

The example parses a HTML string and outputs its title and body content.

String htmlString = "<html><head><title>My title</title></head>"
    + "<body>Body content</body></html>";

This string contains simple HTML data.

Document doc = Jsoup.parse(htmlString);

With the Jsoup's parse() method, we parse the HTML string. The method returns an HTML document.

String title = doc.title();

The document's title() method gets the string contents of the document's title element.

String body = doc.body().text();

The document's body() method returns the body element; its text() method gets the text of the element.

2- Parsing a local HTML file

In the second example, we are going to parse a local HTML file. We use the overloaded Jsoup.parse() method that takes a File object as its first parameter.

        <!-- HTML file -->
        <!DOCTYPE html>
        <html>
            <head>
                <title>My title</title>
                <meta charset="UTF-8">
            </head>
            <body>
                <div id="mydiv">Contents of a div element</div>
            </body>
        </html>

For the example, we use the above HTML file.

    import java.io.File;
    import java.io.IOException;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;

    public class JSoupFromFileEx {
        
        public static void main(String[] args) throws IOException {
            
            String fileName = "path/to/file.html";
            
            Document doc = Jsoup.parse(new File(fileName), "utf-8"); 
            Element divTag = doc.getElementById("mydiv"); 
            
            System.out.println(divTag.text());
        }
    }

In this example:

   Document doc = Jsoup.parse(new File(fileName), "utf-8"); 

We parse the HTML file with the Jsoup.parse() method.

Element divTag = doc.getElementById("mydiv"); 

With the document's getElementById() method, we get the element by its ID.

 System.out.println(divTag.text());

The text of the tag is retrieved with the element's text() method.

3- Reading a web site's title

In the following example, we scrape and parse a web page and retrieve the content of the title element.

    import java.io.IOException;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    public class JSoupTitleEx {

        public static void main(String[] args) throws IOException {
            
            String url = "http://www.aboullaite.com";
            
            Document doc = Jsoup.connect(url).get();
            String title = doc.title();
            System.out.println(title);
        }
    }

In the code example, we read the title of a specified web page.

 Document doc = Jsoup.connect(url).get();

The Jsoup's connect() method creates a connection to the given URL. The get() method executes a GET request and parses the result; it returns an HTML document.

String title = doc.title();

With the document's title() method, we get the title of the HTML document.

4- Reading HTML source

The next example retrieves the HTML source of a web page.

    import java.io.IOException;
    import org.jsoup.Jsoup;

    public class JSoupHTMLSourceEx {

        public static void main(String[] args) throws IOException {
            
            String webPage = "http://www.aboullaite.com";

            String html = Jsoup.connect(webPage).get().html();

            System.out.println(html);
        }
    }

The example prints the HTML of a web page.

String html = Jsoup.connect(webPage).get().html();

The html() method returns the HTML of an element; in our case the HTML source of the whole document.

5- Getting meta information

Meta information of a HTML document provides structured metadata about a Web page, such as its description and keywords.

      import java.io.IOException;
      import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;

      public class JSoupMetaInfoEx {

          public static void main(String[] args) throws IOException {
              
              String url = "http://www.jsoup.org";
              
              Document document = Jsoup.connect(url).get();

              String description = document.select("meta[name=description]").first().attr("content");
              System.out.println("Description : " + description);

              String keywords = document.select("meta[name=keywords]").first().attr("content");
              System.out.println("Keywords : " + keywords);
          }
      }

The code example retrieves meta information about a specified web page.

 String keywords = document.select("meta[name=keywords]").first().attr("content");

The document's select() method finds elements that match the given query. The first() method returns the first matched element. With the attr() method, we get the value of the content attribute.

The next example parses links from an HTML page.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

    public class JSoupLinksEx {

        public static void main(String[] args) throws IOException {
            
            String url = "http://jsoup.org";

            Document document = Jsoup.connect(url).get();
            Elements links = document.select("a[href]");
            
            for (Element link : links) {
                
                System.out.println("link : " + link.attr("href"));
                System.out.println("text : " + link.text());
            }
        }
    }

In the example, we connect to a web page and parse all its link elements.

Elements links = document.select("a[href]");

To get a list of links, we use the document's select() method.

6- Sanitizing HTML data

Jsoup provides methods for sanitizing HTML data.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.safety.Cleaner;
    import org.jsoup.safety.Whitelist;

    public class JsoupSanitizeEx {
        
        public static void main(String[] args) {
            
            String htmlString = "<html><head><title>My title</title></head>"
                    + "<body><center>Body content</center></body></html>";

            boolean valid = Jsoup.isValid(htmlString, Whitelist.basic());
            
            if (valid) {
                
                System.out.println("The document is valid");
            } else {
                
                System.out.println("The document is not valid.");
                System.out.println("Cleaned document");
                
                Document dirtyDoc = Jsoup.parse(htmlString);
                Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);

                System.out.println(cleanDoc.html());
            }
        }
    }

In the example, we sanitize and clean HTML data.

String htmlString = "<html><head><title>My title</title></head>"
    + "<body><center>Body content</center></body></html>";

The HTML string contains the center element, which is deprecated.

boolean valid = Jsoup.isValid(htmlString, Whitelist.basic());

The isValid() method determines whether the string is a valid HTML. A white list is a list of HTML (elements and attributes) that can pass through the cleaner. The Whitelist.basic() defines a set of basic clean HTML tags.

Document dirtyDoc = Jsoup.parse(htmlString);
Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);

With the help of the Cleaner, we clean the dirty HTML document.

7- Grabs All Images

This example shows you how to use the Jsoup regex selector to grab all image files (png, jpg, gif) from my company website “x-hub.io”.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    import java.io.IOException;

    public class JsoupImageEx {

        public static void main(String[] args) {

            Document doc;
            try {

                //get all images
                doc = Jsoup.connect("http://x-hub.io").get();
                Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
                for (Element image : images) {

                    System.out.println("\nsrc : " + image.attr("src"));
                    System.out.println("height : " + image.attr("height"));
                    System.out.println("width : " + image.attr("width"));
                    System.out.println("alt : " + image.attr("alt"));

                }

            } catch (IOException e) {
                e.printStackTrace();
            }

        }

    }

7- Get form attributes in html page

Getting form input element in a webpage is very simple. Find the FORM element using unique id; and then find all INPUT elements present in that form.

    Document doc = Jsoup.soup.connect("http://x-hub.io").get();
    Element formElement = doc.getElementById("contactForm");  
     
    Elements inputElements = formElement.getElementsByTag("input");  
    for (Element inputElement : inputElements) {  
        String key = inputElement.attr("name");  
        String value = inputElement.attr("value");  
        System.out.println("Param name: "+key+" \nParam value: "+value);  
    } 

9- Performing a Google search

The following example performs a Google search with Jsoup.

    import java.io.IOException;
    import java.util.HashSet;
    import java.util.Set;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    public class JsoupGoogleSearchEx {

        private static Matcher matcher;
        private static final String DOMAIN_NAME_PATTERN
                = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}";
        private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN);

        public static String getDomainName(String url) {

            String domainName = "";
            matcher = patrn.matcher(url);
            
            if (matcher.find()) {
                domainName = matcher.group(0).toLowerCase().trim();
            }
            
            return domainName;
        }

        public static void main(String[] args) throws IOException {

            String query = "Devoxx Morocco";

            String url = "https://www.google.com/search?q=" + query + "&num=10";

            Document doc = Jsoup
                    .connect(url)
                    .userAgent("Jsoup client")
                    .timeout(5000).get();

            Elements links = doc.select("a[href]");

            Set<String> result = new HashSet<>();

            for (Element link : links) {

                String attr1 = link.attr("href");
                String attr2 = link.attr("class");
                
                if (!attr2.startsWith("_Zkb") && attr1.startsWith("/url?q=")) {
                
                    result.add(getDomainName(attr1));
                }
            }

            for (String el : result) {
                System.out.println(el);
            }
        }
    }

The example creates a search request for the "Devoxx Morocco" term. It prints ten domain names that match the term.

 private static final String DOMAIN_NAME_PATTERN
    = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}";
 private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN);

A Google search returns long links from which we want to get the domain names. For this we use a regular expression pattern.

    public static String getDomainName(String url) {

        String domainName = "";
        matcher = patrn.matcher(url);
        
        if (matcher.find()) {
            domainName = matcher.group(0).toLowerCase().trim();
        }
        
        return domainName;
    }

The getDomainName() returns a domain name from the search link using the regular expression matcher.

String query = "Devoxx Morocco";

This is our search term.

 String url = "https://www.google.com/search?q=" + query + "&num=10";

This is the url to perform a Google search.

  Document doc = Jsoup
    .connect(url)
    .userAgent("Jsoup client")
    .timeout(5000).get();

We connect to the url, set a 5 s time out, and send a GET request. A HTML document is returned.

Elements links = doc.select("a[href]");

From the document, we select the links.

      Set<String> result = new HashSet<>();

      for (Element link : links) {

          String attr1 = link.attr("href");
          String attr2 = link.attr("class");
       
              result.add(getDomainName(attr1));
          
      }

Finally, we print the domain names to the console.

 for (String el : result) {
   System.out.println(el);
 }

That’s all for this very easy yet very powerful and useful library!