Hamshahri Corpus

The Hamshahri Corpus (Persian: پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group^[1] of University of Tehran. Later, a team headed by Abolfazl AleAhmad^[2] built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks.

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments.

Version 1.0

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average size of 1.8 KB.

The corpus is available in several formats for download:^[2]

Tagged Text: 560 MB
In SQL Server 2000 Tables: 712 MB

Version 2.0

The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:

More News: 323,616 Text Stories in 3206 XML files (one file for each day)
Increased Time Span: from 22 June 1996 to 13 May 2007
Bigger in Size: 1.42 GB uncompressed
Standard Container: Unicode XML
Included Images: images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
Categorized News: the news stories have been categorized semi-automatically (appropriate for text categorization and classification tasks).

The corpus is available for download in XML format.

References

^ DBRG News Archived 2017-05-15 at the Wayback Machine Database Research Group
^ ^a ^b Hamshahri Archived 2017-05-14 at the Wayback Machine Database Research Group

External links

Hamshahri Corpus Homepage Archived 2017-05-14 at the Wayback Machine
irBlogs Collection Homepage

This article about a digital library is a stub. You can help Wikipedia by expanding it.

This Indo-European languages-related article is a stub. You can help Wikipedia by expanding it.

[1] DBRG News Archived 2017-05-15 at the Wayback Machine Database Research Group

[ham-2] Hamshahri Archived 2017-05-14 at the Wayback Machine Database Research Group

[1]

[2]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

Hamshahri Corpus

Version 1.0

Version 2.0

See also

References

External links