Java XMLParser

Order Description

XML is a widely-used text-based format for data as well as instructions. This format arranges information in a hierarchical fashion.

An XML document is made of tags. There are two types of tags: start and end. A start tag is of the format <tagname>, and an end tag is of the format </tagname>. A valid XML document has the following characteristics:

1. Each start tag has a corresponding end tag, with the same tag name. For example:

<html> This is an HTML page</html>

 

<root> <leaf> Leaf </leaf></root>

2. As the '<'and '>' characters are used to start and end a tag, they are reserved (i.e. they cannot be included in a valid XML input except for this purpose).

3. Between a start tag and its corresponding end tag, information can be in the form of (a) another tag pair or (b) an arbitrary string or both. For example

<root>

    Some data

    <child> Some child data </child>

</root>

4. Tags must be properly nested. That is, if tag A starts before tag B, then B must end before A. Thus B is completedly enclosed within A. For example:

<html>

</html>

 

<root>

    <leaf>This is a leaf </leaf>

    <leaf>This is another leaf </leaf>

</root>

5. A tag name can contain only characters 'a-z', 'A-Z', '0-9', :, _, and -. However, a tag name cannot start with a number or -.

Valid examples include:

<html>

    <body0> ... </body0>

    <_xml> ...</_xml>

</html>

Invalid examples include:

<ht<ml>

    <0body> ... </0body>

    <-xml> ...</-xml>

</html>

6. The outermost tag pair is called the root. Each valid XML data has exactly one root. For example, the following is invalid:

<html>

    <body>This is a body </body>

</html>

<html>

    <body>This is another body </body>

</html>

7. White space characters (space, tab) are allowed at any place, except tag names.

An XML parser is a program that reads and parses an XML format assuming the above format, and produces some result. This result may include simply validating whether the given text forms valid XML or not.

When an XML document must be transmitted over a network, it is received character-by-character. In this assignment you will implement a parser that receives text one-character-at-a-time, validates it and produces additional textual outputs.

Note: The actual XML specification has many other features, but for this assignment, we will consider only the above features.

2.2 The XMLParser interface

You have been provided an XMLParser interface. This interface contains two methods:

· A method that takes a single character as input, and returns an XMLParser object that is the result of parsing this character along with all others input before it. The returned object makes it possible to chain inputs:

xmlObj.input('<').input('h').input('t')...

This method also throws a custom InvalidXMLException when the input character causes the inputs given thus far to be invalid XML.

· A method that returns the output of the parser as a String. The nature and format of this output depends on the implementation.

2.3 XML Validator

You must write an implementation XMLValidator that implements the provided XMLParser interface. This class acts as a validator of XML, reporting whether the characters given to it collectively form valid XML.

This implementation’s behavior should have the following characteristics:

1. It should check all of the characteristics above regarding valid XML.

2. The output method should return a single word that represents the current status of the input provided thus far. If no inputs have been provided yet, the method should return "Status:Empty". If the inputs provided form complete, valid XML (i.e. all tag names are valid, each start tag has a corresponding end tag, tags are properly nested, root tag occurs only once) then the method should return "Status:Valid". If the inputs thus far can be part of valid XML but the data is not yet complete (e.g. part of any of the above valid examples) it should return "Status:Incomplete".

3. It should throw the InvalidXMLException at the input character that causes the XML to be invalid. For example, if the inputs are '<','h','t','m','<','>' then the outputs after each of the first four characters should be "Status:Incomplete" and it should throw an exception with the fifth input. The parser becomes unusable after this (i.e. its behavior if inputs are continued is undefined).

2.4 XML Logger

An XML parser can be used to not only check the validity of XML input but also to parse and extract data from it.

You must write a XMLInfoLogger class that implements the XMLParser interface. This class provides a more elaborate output in the form of a log of the tags and data as it detects them.

This implementation’s behavior should have the following characteristics:

1. The output method should return a string that represents the parts of the input that have been successfully processed up to this point:

a. If a start tag <tagname> has been entered, it should add a line Started:tagname to the output.

b. If an end tag </tagname> has been entered, it should add a line Ended:tagname to the output.

c. If there are characters that are not part of a tag name, it should add Characters: followed by the characters verbatim to the output, all on one line (except if the characters include new lines), only if these characters are followed by a valid start or end tag. This includes whitespace characters.

2. This class should check for all the validity constraints and throw exceptions in the same manner as specified in earlier sections.

There should be a new line after the last line.

Sample outputs

For each of the examples below, the output represents the output after all the characters in the input have been entered. Please note that these inputs are partial (i.e. there may be more characters after them, which will change the output accordingly).

Highlight the outputs with the mouse to see whitespace characters.

Input:

<html> This is a body</html>

Output:

Started:html

Characters: This is a body

Ended:html

Input:

<html> This is \n a body <

Output:

Started:html

Input:

<html> This is    a body</html>

Output:

Started:html

Characters: This is     a body

Ended:html

Input:

<html>_<head> This is a heading</head><p>Paragraph</p></html>

Output:

Started:html

Characters:_

Started:head

Characters: This is a heading

Ended:head

Started:p

Characters:Paragraph

Ended:p

Ended:html

 

 

 

You are not allowed to use any existing XML parsing classes in your implementations!