Business need for Web Scraping:
Increasingly, organizations need to harvest and structure vast volumes of content posted and maintained on public websites. Almost all the Fortune 500 companies do some type of web scraping already because websites are often the only updated sources of valuable content they need.
One of the key areas of growing interest is policy, procedure, legal and regulatory content posted and maintained on global websites. For many companies, there are hundreds or thousands of entities and jurisdictions that publish this type of information – and they need to be up to date and on top of changes. This type of information exists in numerous formats and languages, using a wide range of presentation approaches and formats.
Much original source material today appears only on the web or with the web version as the copy of record. The volume and complexity of this type of information means that manual approaches would be slow, error prone and cost prohibitive, particularly if they need the results in a customized file that can be imported directly into internal business systems. The costs multiply when information needs a daily or weekly update. This is when automation becomes important, and why organizations choose a Web Scraping service.
What is Web Scraping?
Web Scraping, also called web harvesting or web data extraction, is a process to extract data from websites. This is often thought of as web crawler software (sometimes referred to as web spiders or indexers) to navigate through links on a website to download the contents of relevant pages. Many sites also offer RSS feeds that provide an electronic means of downloading updates and changes.
Limitations of typical Web Scraping:
Typical Web Scraping has significant limitations that include:
- Not being configured to handle different authentication needs for different sites
- Over-scraping – ending up with so much content, it’s not useful
- Missing metadata – failure to specifically capture the needed metadata
- Limited ability to deal with complex websites – many websites are designed to not be easily crawled
- Repeatability – most crawlers are not implemented as an ongoing process to deal with content changes over time
- Not being mined intelligently to use Natural Language Processing (NLP) and artificial language techniques to extract additional context for use in downstream processing
- Lack of flexibility – different content types need to be output differently – some into a database, some as XML to be ingested by a business system, some in human-friendly formats
Beyond simple Scraping – a deeper solution:
A truly useful solution goes beyond web scraping: it’s website harvesting and AI based transformations of content into useful formats for the organization.
For updates, some sites provide RSS feeds. But often there is a need to go beyond RSS feeds as these are limited to what the website administrator chooses to provide, which may be incomplete. There may be missing metadata, filtering changes, normalization requirements, format/publishing needs and the need for accurate metadata.
Sites are global and multi-lingual and contain information in multiple formats, such as HTML, PDF, XML, RTF, and DOCX. This necessitates a deeper solution where data is downloaded, normalized, structured, and converted into a common XML format with defined metadata, and related content is linked. It's also important to not make these crawling efforts look like attacks on the system, which would trigger DDoS alarms (Distributed Denial of Service).
DCL has developed methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream which we then convert into XML for ingestion into derivative databases, data analytics platforms, and other downstream systems. This process of normalization and transformation of the content to automate the importing into a customer’s business system helps to maximize business value. A key to successful projects is the depth and quality of up-front analysis to ensure complete and accurate results.
DCL’s solution harnesses technology in Natural Language Processing and Machine Learning to help enable solutions powered by Artificial Intelligence. With sophisticated automated processes, DCL optimizes content to collect information, streamline compliance, facilitate migration to new systems and databases, maximize reuse potential, and ready it for delivery to all outputs.
Integral elements of DCL’s solution include:
- Filtering programs
- Downloading handler
- Metadata gatherer
- File differencing programs
- Natural Language Processing programs
- Data and content transformation programs
- Secure repository
Why outsource your web crawling/scraping projects?
Organizations are increasingly looking to outsource their Web Scraping needs to a trusted partner with the requisite experience. They do this for several key reasons:
- The fact that every website is different
- Dealing with a variety of site authentication needs
- The need for upfront human analysis of target websites and content
- Setting the depth of scraping at the right level
- Correctly managing all of the linked content
- Ensuring that the right metadata is captured
- Recurring sustained need for monitoring and validating changes/updates
- Maintaining an accurate and complete audit trail
- Lack of specific technical expertise
DCL presented the paper "White Hat Web Crawling: Industrial-Strength Web Crawling for Serious Content Acquisition" at Balisage: the Markup Conference 2018 in Rockville, Maryland.
View DCL's webinar "Web Scraping: Science or Art?" presented on January 9, 2018.
Contact DCL for more information.