How to Read Xml File in C# Code
XML in C
Status
Personal thoughts on what the XML syntax should be. Compare with my earlier notes.
Abstract
XML is a base-linguistic communication for expressing arbitrary structured data in text course. It consists of several modules: cadre syntax, meta-syntax, linking, fashion-bindings, and perchance more. Of these only the core syntax is mutual to all XML applications. Applications can choose to omit the other modules if they don't demand them.
This text describes ane possible core syntax, using flex/bison specifications. The well-nigh important additions relative to the XML-lang typhoon of 30 June 1997 include: automatically ignored newlines, attribute defaults, and boolean attributes.
Why this specification?
Split the core syntax and the meta-syntax
The cadre syntax of XML specifies a very general language within which all XML applications take to stay. Nigh applications will want to restrict this language, and XML provides an optional module with a meta-syntax for writing those restrictions. The restrictions are ofttimes referred to every bit a DTD, Document Type Definition, later SGML, where this term was introduced.
In the XML-lang draft of xxx June 1997 the cadre syntax and the meta-syntax are combined into a single draft. The linking module and the planned style-canvas binding modules are kept in separate draft. There are good reasons for keeping the core syntax and the meta-syntax divide:
-
For consistency.
-
Because the current meta-syntax is not very good and there are other proposals in split documents (due east.g., XML-data).
-
Because it makes the draft easier to read.
-
To allow ane to be changed without the other.
"RE delenda est"
This Latin phrase means "RE is to exist deleted." It refers to a rule from SGML that specifies in which contexts an "RE" (Tape-End, SGML-speak for a newline) is to be ignored. The precise rules in SGML are very complicated, but in full general a newline is ignored after a kickoff tag and before an end tag. This allows SGML documents to be somewhat pretty-printed, by starting tags on a new line.
XML also has start and stop tags, only none of the exceptions of SGML, and then the "RE delenda rule" can be applied without any problems.
In fact, looking at how people write XML and HTML, the rule good be generalized a bit, to say that a newline before and "<" and after any ">" is to be ignored, whether that "<" is part of a get-go tag or non.
There is a lot of confusion over this issue. The first applications that are based on XML seem to presume that not only 1 newline is ignored, merely that all whitespace, even multiple lines, is to be ignored. While this allows even more "pretty printing", it likewise ways that a lot of meaningful spaces accept to be escaped (equally &32;).
The 30 June draft of XML on the other hand says that no whitespace is to be ignored, not even a single newline.
Most people meanwhile seem to concur that ignoring one newline is a expert compromise. It allows tags to exist put on carve up lines, while not requiring meaningful whitespace to be escaped. That is therefore what the syntax below describes.
Default attributes
Peculiarly if an awarding uses the linking module, it volition benefit a lot from existence able to specify defaults for attributes. The xxx June typhoon relies on the meta-syntax to provide default attribute values. This is not a good idea, for several reasons:
-
The meta-syntax is not very proficient and may modify.
-
Many application that could do good from attribute defaults take no employ for the residual of the meta-syntax (or cannot afford the price and complexity of parsing the meta-syntax).
-
Restricting the syntax and setting defaults are logically two very different things and should not be mixed so easily.
The syntax below therefore includes an attribute defaulting mechanism that is part of the core syntax.
Boolean attributes
All attributes in XML are by default string valued, although the meta-syntax should be able to restrict that. There are unlike proposals for doing that. One interesting one is Tim Bray'southward proposal
But there is i very simple blazon that is useful in almost all applications and that can be added to the core syntax without complicating it, and that is booleans. The syntax beneath therefore includes boolean attributes as well as string-valued ones.
The code
The code is in two parts: a flex tokenizer and a bison grammar. As well included are a test program and a makefile. Below is some documentation for each of them. To download all of them together, download this tarfile.
Flex scanner
(See the source.)
The actual scanner code is very brusk. After all, there are only 12 tokens to be recognized. The code relies on a few macros that keep the code clear:
- nl
-
A newline tin exist either a carriage render, a line feed, or both.
- ws
-
Whitespace is whatever sequence of ane or more spaces, tabs, carriage returns or line feeds.
- open
-
The rule that a newline is to be ignored but before a "<" is expressed past this macro, that combines and optional newline and a "<".
- close
-
Aforementioned for the delimiter that signals the end of mark-upwards: a ">" optionally followed by a newline.
- namestart
-
This represents all the characters that tin can kickoff a proper noun (element name, attribute proper name). This code doesn't effort to deal with character encodings (most 8-chip encodings, also equally UTF-8 should piece of work fine, though), so information technology simply accepts all non-ASCII characters as proper noun commencement characters. This is probably too lenient, but since all the delimiters in XML are from the ASCII set, it doesn't really matter.
- namechar
-
All the characters that are allowed in a proper name, after the kickoff character. The aforementioned leniency as for namestart above.
- data
-
The information in an XML file, i.eastward., the characters between a beginning and end tag, are matched by this regular expression, that accepts all characters except a "<", and merely accepts a newline if it is not immediately followed by a "<". In that location may be escaped characters in this information, of the form "&#[0-nine]+;" or "
[0-9a-f]+;". This program doesn't expand them. To exercise that would require implementing the character encodings and the plan currently doesn't do that.
- string
-
A string is something between double or unmarried quotes. Like data, it may include escaped characters.
The scanner works in ane of two modes (commencement atmospheric condition). The INITIAL mode ignores white space and recognized names, strings, and well-nigh of the other tokens. It is active as the plan starts and every time the tokenizer is in betwixt "<" and ">".
The CONTENT manner is entered after the ">" of whatever start or terminate tag. In this mode merely data, "<", comments, and the commencement of an aspect defaults declaration are recognized.
Bison grammar
(See the source.)
The grammar contains but 13 productions, and it could have been shorter and clearer if Bison had accustomed some common notations for grammars. The grammar that is really intended is equally follows:
document: prolog element misc*; prolog: VERSION? ENCODING? misc*; misc: Annotate | attribute_decl; attribute_decl: ATTDEF NAME aspect+ ENDDEF; element: Outset aspect* empty_or_content; empty_or_content: SLASH Close | CLOSE content END Name? CLOSE; content: (Data | misc | chemical element)*; attribute: NAME (EQ VALUE)?;
... or simply viii productions.
Test awarding
(See the source.)
The exam awarding merely calls the parser to parse standard input.
Bert Bos
$Appointment: 1997/07/09 xx:44:xix $
Source: https://www.w3.org/XML/9707/XML-in-C
0 Response to "How to Read Xml File in C# Code"
Postar um comentário