我从供应商处获取这些 xml 文件,它是 NITF(新闻)模式和http://www.xmlnews.org/namespaces/meta#新闻元数据模式(来自 Space 1999!)的包装器
不幸的是,它们根本没有在外部文档上声明任何命名空间。这是他们给我们的:
<?xml version="1.0"?>
<document>
<nitf>
<head>...</head>
<body>...</body>
etc
</nitf>
<xn:Resource xmlns:xn="http://www.xmlnews.org/namespaces/meta#">...</xn:Resource>
</document>
我试图查看是否可以通过创建 xml 模式集合并对其进行解析来提高吞吐量,但是 xml 文本中缺少任何命名空间声明让我感到困惑。
我试过把
;WITH XMLNAMESPACES (default 'http://iptc.org/std/NITF/2006-10-18/')
SELECT CAST(rawXml as XML(NitfSchemaCollection))
但它不喜欢它(XML Validation: Declaration not found for element 'document' exception)。
我什至尝试使用 ;WITH XMLNAMESPACES 将原始 xml 解析为 XML 类型,然后将其转换为 XML(NitfSchemaCollection),但同样的问题。
所以我的问题是:
- 除了重写来自供应商的传入 xml 文档之外,有什么方法可以将 xml 模式集合应用于解析?
和
- 解析类型会产生足够的性能增强以使其进一步追求吗?
我们目前使用的是 Sql Server 2008 sp4,但如果这可能会改变某些东西,我可以在更新的实例上尝试它。
编辑:这是一个示例文档。nitf和xn:Resource节点都符合两个非常古老的新闻专线服务序列化标准。对于我的模式集合,我添加了两者,并调整了 nitf 以添加非标准的文档节点。该架构对于帖子来说很长,但如果有人感兴趣,我可以添加它们。
<?xml version="1.0"?>
<document>
<nitf>
<head>
<title>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</title>
</head>
<body>
<body.head>
<hedline>
<hl1>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</hl1>
</hedline>
<byline>
<bytag>By Caroline White</bytag>
</byline>
<distributor>Telegraph Group</distributor>
</body.head>
<body.content>
<p><em>'I am thinking of cancelling my Easter holiday and chartering a yacht to whisk my immediate family off to sea. The idea is that we can still enjoy the trip of a lifetime without risking contracting the coronavirus. How would you recommend proceeding?'</em></p>
<p>If you’ve got the wallet for it, a superyacht charter offers the most luxurious seclusion on the planet – and like the hand sanitiser aisle in Boots, you’re not the first to think of it. Some brokers anticipate an uptick in superyacht sales, as UHNWI look to create safe havens, and wealthy holidaymakers are likely to follow suit. So get moving.</p>
<p>The first step is to recruit a charter broker – try Fraser, Burgess, YPI or <org value="ACORN:3601037911" idsrc="xmltag.org" >Camper & Nicholsons</org>. They will gauge your budget, preferences and read your personality (are you too formal for that laid-back Aussie captain; are you too wild for that silver-service English crew) then come back to you with a bespoke selection of options. The next step is a rather blissful journey through yacht brochures. Then there are the itineraries to flick through: beach barbeques, diving days and suppers under the stars…</p>
...blah blah blah...
<p><em><em>If you have a question for any of our Telegraph Luxury experts, on any topic, please email <a href="http://mailto:[email protected]/">[email protected]</a></em></em></p>
<p><em>Last week on First World Problems</em></p>
<p><a href="https://www.telegraph.co.uk/luxury/womens-style/first-world-problems-expensive-blonde-highlights-mayfair-salon/">First World Problems: 'Are expensive highlights at a Mayfair salon worth the price-and the journey?'</a></p>
<p><em><em>Sign up for the <a href="https://www.telegraph.co.uk/newsletters/Luxury/">Telegraph Luxury newsletter</a> for your weekly dose of exquisite taste and expert opinion.</em></em></p>
</body.content>
</body>
</nitf>
<xn:Resource xmlns:xn="http://www.xmlnews.org/namespaces/meta#">
<xn:providerName>Telegraph Group</xn:providerName>
<xn:providerCode>127</xn:providerCode>
<xn:serviceName>Telegraph Online</xn:serviceName>
<xn:serviceCode>2</xn:serviceCode>
<xn:resourceID>202003100715TELEGR__ONLINE___60979152</xn:resourceID>
<xn:publicationTime>2020-03-10T07:15:00-04:00</xn:publicationTime>
<xn:receivedTime>2020-03-10T07:50:43-04:00</xn:receivedTime>
<xn:title>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</xn:title>
<xn:rendition>202003100715TELEGR__ONLINE___60979152.xml</xn:rendition>
<xn:vendorData>WAVO:Publish Reason=CORRECTED</xn:vendorData>
<xn:vendorData>WAVO:alert=FALSE</xn:vendorData>
<xn:vendorData>WAVO:headline_only=FALSE</xn:vendorData>
<xn:vendorData>WAVO:temporary=FALSE</xn:vendorData>
<xn:vendorData>AMX:Publish Reason=CORRECTED</xn:vendorData>
<xn:vendorData>AMX:Alert=FALSE</xn:vendorData>
<xn:vendorData>AMX:Headline Only=FALSE</xn:vendorData>
<xn:vendorData>AMX:Temporary=FALSE</xn:vendorData>
<xn:vendorData>AMX:Special Code=PS/p.TELEGR__</xn:vendorData>
<xn:vendorData>AMX:Special Code=PS/s.ONLINE__</xn:vendorData>
<xn:copyright>Copyright © 2020 Telegraph.co.ukk. All rights reserved</xn:copyright>
<!-- Entity Extractor -->
<xn:companyCode>ACORN:A.3601037911#6#60#60</xn:companyCode>
<xn:companyCode>ACORN:A.2295203068#6#60#60</xn:companyCode>
<xn:industryCode>IC/fini#6#50#60</xn:industryCode>
<xn:industryCode>IC/fini.bank#6#60#60</xn:industryCode>
<xn:industryCode>IC/fini.invs#6#60#60</xn:industryCode>
<xn:industryCode>IC/fini.secr#6#60#60</xn:industryCode>
<xn:industryCode>IC/svcs#6#50#60</xn:industryCode>
<xn:industryCode>IC/svcs.prof#6#60#60</xn:industryCode>
<xn:locationCode>LB/car#7#70#49</xn:locationCode>
<xn:locationCode>LR/car#9#70#90</xn:locationCode>
<xn:locationCode>LU/car#9#70#90</xn:locationCode>
<xn:locationCode>LU/car.any#7#49#70</xn:locationCode>
<xn:subjectCode>NZ/COID#6#50#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.1475554280#6#60#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.27088#6#60#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.5838940#6#60#60</xn:subjectCode>
<!-- Classifier -->
<xn:subjectCode>IS/lifesoc.privair#5#50#50</xn:subjectCode>
<xn:subjectCode>MC/HOT#6</xn:subjectCode>
<xn:subjectCode>NC/67115358#9#98#50</xn:subjectCode>
<xn:subjectCode>NC/67115586#5#55#50</xn:subjectCode>
<xn:subjectCode>NC/67119129#5#58#50</xn:subjectCode>
<xn:subjectCode>NC/67119169#5#50#50</xn:subjectCode>
<xn:vendorData>AMX:Special Code=PT/updated</xn:vendorData>
<xn:subjectCode>XC/any#6#50#60</xn:subjectCode>
<xn:subjectCode>XC/any.company#6#60#50</xn:subjectCode>
<xn:subjectCode>XC/Private#6#60#50</xn:subjectCode>
<!-- Rules -->
<xn:subjectCode>MC/BIZREL#1</xn:subjectCode>
<xn:subjectCode>NE/BAYERINS#5#58#50</xn:subjectCode>
<xn:subjectCode>NE/GEOAMER#9#70#90</xn:subjectCode>
<xn:subjectCode>NE/GEOCARIB#9#70#90</xn:subjectCode>
<xn:industryCode>NI/Banks#6#60#60</xn:industryCode>
<xn:industryCode>NI/Finance#6#60#60</xn:industryCode>
<xn:industryCode>NI/Securities#6#60#60</xn:industryCode>
<xn:industryCode>NI/Services#6#60#60</xn:industryCode>
<xn:vendorData>AMX:Special Code=TL/americas#7#70#50</xn:vendorData>
<xn:vendorData>AMX:Special Code=TL/LOC#7#50#70</xn:vendorData>
<xn:vendorData>AMX:Special Code=TT/TOPIC#5#50#50</xn:vendorData>
<xn:vendorData>AMX:Special Code=TT/transport#5#50#50</xn:vendorData>
<xn:language>en</xn:language>
</xn:Resource>
</document>
我们的处理必须解析这些文档,然后我们尝试将一些元数据属性标准化为各种表和列。
只是解析未知的 xml,我认为 Sql Server 必须为每个解析的文档从一个空白名称表开始;我认为键入的 xml 列以已知词汇表开头,应该更快。另外希望 xquery 也会更快。
以下是我们在处理过程中执行的查询示例:
;WITH XMLNAMESPACES ('http://www.xmlnews.org/namespaces/meta#' AS xn)
Insert Into dbo.NewsStory
Select NewsID,provider,service,
CASE When provider='AMSPIDER' and Service='ACBJ' and PublicationAbbrev='web.site' Then dbo.fnGetSpiderPubAbbrev(PublicationAbbrev_Spider) Else PublicationAbbrev End As PublicationAbbrev,
Title, PublishDate, AMXReceivedTime, AllowedReleaseTime,ParsedDate,DateLine, Description, [Language], PublishReason, IsAlert, IsHeadLine, IsTemporary, Copyright
From (
Select X.NewsID,
replace(RIGHT(RS.c.value('(./xn:vendorData[substring((./text())[1],1,22)="AMX:Special Code=PS/p."]/text())[1]', 'VARCHAR(50)'),8) , '_', '') as provider,
replace(RIGHT(RS.c.value('(./xn:vendorData[substring((./text())[1],1,22)="AMX:Special Code=PS/s."]/text())[1]', 'VARCHAR(50)'),8) , '_', '') as service,
CONVERT(NVARCHAR(max),RS.c.query('xn:vendorData'))) as PublicationAbbrev,
replace(RS.c.value('(./xn:vendorData[substring((./text())[1],1,11)="AMX:Credit="]/text())[1]', 'VARCHAR(200)'),'AMX:Credit=', '') as PublicationAbbrev_Spider,
RS.c.value('(./xn:title/text())[1]', 'VARCHAR(200)') AS Title,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:publicationTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS PublishDate,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:receivedTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS AMXReceivedTime,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:releaseTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS AllowedReleaseTime, getdate() as ParsedDate,
RS.c.value('(./xn:dateline/text())[1]', 'VARCHAR(200)') AS DateLine,
RS.c.value('(./xn:description/text())[1]', 'VARCHAR(2000)') AS Description,
RS.c.value('(./xn:language/text())[1]', 'VARCHAR(10)') AS [Language],
LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((.)[1],1,19)="AMX:Publish Reason="])[1]','VARCHAR(45)'),20,25)) AS PublishReason,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,10)="AMX:Alert="]/text())[1]','VARCHAR(45)'),11,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsAlert,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,18)="AMX:Headline Only="]/text())[1]','VARCHAR(45)'),19,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsHeadLine,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,14)="AMX:Temporary="]/text())[1]','VARCHAR(45)'),15,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsTemporary,
RS.c.value('(./xn:copyright/text())[1]', 'VARCHAR(1000)')AS Copyright
From @XmlFileTable X CROSS APPLY AMXFile.nodes('/document/xn:Resource') RS(c)
) A
架构集合来自 NITF 源 ( https://www.iptc.org/std/NITF/3.6/specification/nitf-3-6.xsd ) 和 xmlnews dtd ( http://www.xmlnews.org/ dtds/xmlnews-meta-dtd.zip )。
我使用 Visual Studio 将 xmlnews dtd 转换为模式并使用它来播种 NitfSchemaCollection。
然后我将 NITF 模式调整为
删除包含(显然是我不需要的 Ruby 的一个小子集)
添加到标题
... xmlns:xn="http://www.xmlnews.org/namespaces/meta#">
<import namespace="http://www.xmlnews.org/namespaces/meta#" />
在nitf元素声明上方添加了一个文档元素,以匹配供应商发送给我们的内容。例如
<element name="document"> <complexType> <sequence> <element ref="nitf:nitf" minOccurs="1" maxOccurs="1" /> <element ref="xn:Resource" minOccurs="1" maxOccurs ="1" /> </sequence> </complexType> </element>
每个文档只有 1 个nitf节点和 1 个xn:Resource节点,但 xn:Resource 下的子节点可以有很多实例。
您正在解析的 XML 部分不受架构限制,而是由 DTD 限制,因此您不能使用架构排序规则使 SQL Server 的解析有所不同。也就是说,我还没有看到架构在您将 XML 文档分解为表的场景中很有帮助的情况,并且增加了根据架构验证 XML 的开销。
您可以在查询中做一些事情来提高效率。
在下面的查询中,我更改了日期的处理,将
text()
in 谓词移动到谓词之前并.
在谓词中使用,并exist()
在您检查布尔值的地方使用。请注意,在我的测试中发生在我身上的是重写没有并行进行,因此在比较性能时请记住这一点。您可能喜欢它只在繁忙的服务器中使用一个线程,或者您可能想使用您拥有的所有东西。如果您希望查询并行,您可以使用跟踪标志
OPTION(QUERYTRACEON 8649)
,或者如果您更喜欢串行计划,请使用option (maxdop 1)
.因此,在我对 SQL Server 2008 的测试中,重写的性能大约快了两倍。
看看我在这里做了什么,如果你喜欢就使用它并测试你的数据。