#compsci Extensible Markup Language (**XML**) is a [[Markup language|markup language]] and file format for storing, encoding and transmitting data. Both human-readable and machine-readable. ![[Pasted image 20260514194923.png]] ## Function XML's main purpose is serialization, that is, storing, transmitting, and reconstructing arbitrary data. ### Overview XML label, categorizes, and structurally organizes information. XML tags represent the data structure and contain metadata, but what is within the tags idata, encoded in the way the XML standard specifies. An additional XML schema (XSD) defines the necessary metadata for interpreting and validating XML An XML document that adheres to basic XML rules is "well-formed", one that adheres to its schema is "valid". IETF [[Requests for Comments|RFC]] 7303 (superseding the older RFC 3023) defines 3 media types: - application/xml (text/xml is an alias) - application/xml-external-parsed-entity (text/external-parsed-entity is an alias) - application/xml-dtd These are used for transmitting raw XML files without exposing their internal semantics. RFC 7303 further recommends that XML-based languages be given media types ending in +xml, e.g. image/svg+xml for SVG. ### Terminology An XML document is a string of **characters** (every legal [[Unicode]] character, except Null, may appear in an 1.1 XML document) The **processor** (aka **XML parser**) analyzes the markup and passes structured information to an application. The characters making up an XML document are divided into **markup** and **content**. Generally, strings that constitute markup either begin with < and end with a >, or they begin with the character & and end with a ; A **tag** is a markup construct that begins with < and ends with > Types of tags: ![[Pasted image 20260518061205.png]] An **element** is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. Characters between the start-tag and the end-tag are **content**, and may containt markup, including other elements, which are **child elements**. An **attribute** is a markup construct consisting of a name-value pair that exists within a start-tag or empty-element tag. E.g.: `Madonna` or ` Connect A to B. `. An XML attribute can only have a single value and each attribute can appear at most once on each element. If you need a list, you have to encode it into a well-formed XML attribute. XML documents may begin with an **XML declaration** that describes some information about themselves: i.e. `` ### Valid characters XML documents consist of Unicode characters, except for a small number of excluded control characters. XML includes facilities for identifying the encoding of the Unicode characters that make up the document, and for expressing characters that, for one reason or another, cannot be used directly. Escape facilities for including characters that are problematic to include directly: - < represents "<" - > represents ">" - & represents "&" - ' represents the apostrophe ' - " represents the quotation mark " All permitted Unicode characters may be represented with a numeric character reference: &#+number, like 中 ![[Pasted image 20260518070449.png]] ### Comments Comments may appear anywhere in a document outside other markup; they also cannot appear before the XML declaration. Comments begin with . The string "--" (double-hyphen) is not allowed inside comments. ### Well-formedness ![[Pasted image 20260518070539.png]] ### Schemas An XML document may be valid: it contains a reference to a Document Type Definition, and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies. ![[Pasted image 20260518070755.png]] ![[Pasted image 20260518070811.png]] ![[Pasted image 20260518070833.png]] ![[Pasted image 20260518070841.png]] ## Applications XML is used in [[RSS]], Atom, OpenDocument, [[SVG]], [[XMPP]], etc. Standard parsing library in Linux - [[libxml2]] ## History XML is derived from SGML (Standard Generalized Markup Language, which was developed in the 1960s) The versatility of SGML for dynamic information display was understood by early digital media publishers in the late 1980s, and by the mid 1990s some practicioners of SGML had gained experience with the World Wide Web, and believed that SGML offered solutions to some of the problems the Web was likely to face as it grew. **Dan Connolly** added SGML to the list of W3C's activities when he joined the staff in 1995, and work began in mid-1996 when **Jon Bosak** developed a charter and recruited collaborators. XML was compiled by a working group of eleven members, supported by a ~150-member Interest Group. Technical debate took place on the Interest Group mailing list and issues were resolved by consensus or, when that failed, majority vote of the Working Group. A record of design odecisions and their rationales was compiled by **Michael Sperberg-McQueen** in late 1997. James Clark served as Technical Lead of the Working Group, notably contributing the empty-element syntax and the name "XML". Other potential names: "MAGMA" (Minimal Architecture for Generalized Markup Applications), "SLIM" (Structured Language for Internet Markup) and "MGML" (Minimal Generalized Markup Language) The co-editors of the specification were originally **Tim Bray** and **Michael Sperberg-McQueen**, but Bray's acceptance of a consulting engagement with Netscape provoked vociferous protests from Microsoft, and Bray was temporarily asked to resign the editorship. The dispute ended with the appointment of Microsoft's **Jean Paoli** as a third co-editor. XML 1.0 became a W3C Recommendation on February 10, 1998. The first (XML 1.0) has undergone minor revisions since then, without being given a new version number, and is currently in its fifth edition, as published on November 26, 2008. The second (XML 1.1) was initially published on February 4, 2004, the same day as XML 1.0 3rd Edition, and is currently in its second edition, as published on August 16, 2006. Editor - **John W. Cowan** Differences: XML 1.0 has stricter requirements for characters available for use in element and attribute names and unique identifiers (prior to the 5th edition of XML 1.0), XML 1.0 allows the use of less control characters than XML 1.1 ## Criticism Too complex [[JSON]], [[YAML]] are frequently proposed as simpler alternatives