eru – Übersetzung – Keybot-Wörterbuch

Spacer TTN Translation Network TTN TTN Login Français English Spacer Help
Ausgangssprachen Zielsprachen
Keybot 146 Ergebnisse  dannychoo.com  Seite 2
  Málföng  
Háskólinn í Reykjavík og Máltæknisetur stóðu fyrir söfnun íslenskra raddsýna í samstarfi við fyritækið Google. Gögnin eru aðgengileg fyrir almenning hér á síðunni og er þetta kjörið tækifæri til að þróa ýmsan máltæknibúnað fyrir íslensku, eins og til dæmis talgreini.
Reykjavík University and The Icelandic Centre for Language Technology collected data for an Icelandic speech corpus in collaboration with Google. The data is available on this webpage for everybody and this is a good opportunity to develop language technology tools for Icelandic such as a speech recognizer. Voice samples from 563 individuals were recorded with Android G1 smart-phones, a total of 152 hours of speech. In total 127,286 voice samples were recorded. Of those 108,568 were considered useful and 18,718 were discarded. The 108,568 good voice samples can be downloaded from this webpage.
  Málföng  
Fornmálstextarnir eru aðgengilegir til notkunar á tvenns konar hátt:
The texts of the Saga Corpus are available for use in two different ways:
  Málföng  
Þegar mörkun var lokið voru mörk í þeim hluta texta Sturlungu sem eru í þjálfunarsafninu færð í rétt horf miðað við þær skrár sem höfðu verið leiðréttar áður.
After tagging was completed tags in the part of the Sturlunga texts that are also a part of the training corpus were restored to the corrected value.
  Málföng  
Alþingisumræður er málheild með íslensku talmáli. Í málheildinni eru ræður frá Alþingi Íslendinga, alls rúmir tuttugu klukkutímar af upptökum ásamt umritun þeirra í texta. Hljóð- og textaskrár eru samstilltar.
Alþingisumræður (Parliament Speech Corpus) is an Icelandic spoken language corpus that contains twenty hours of speeches from the Icelandic Parliament, in synchronized text- and sound files.
  Málföng  
Alþingisumræður er málheild með íslensku talmáli. Í málheildinni eru ræður frá Alþingi Íslendinga, alls rúmir tuttugu klukkutímar af upptökum ásamt umritun þeirra í texta. Hljóð- og textaskrár eru samstilltar.
Alþingisumræður (Parliament Speech Corpus) is an Icelandic spoken language corpus that contains twenty hours of speeches from the Icelandic Parliament, in synchronized text- and sound files.
  Málföng  
Íðorðabankinn sinnir þessu hlutverki. Hann getur veitt yfirsýn yfir íslenskan íðorðaforða og nýyrði úr almennu máli, sem eru efst á baugi, og stuðlað með því að auknu samræmi í notkun orðanna svo og skilgreiningunum á þeim.
One of the roles of the Term Banks is to standardize the use of terms within related and unrelated subject fields. The aim is to hinder that many different terms are used for the same concept or phenomenon. The Term Bank provides an overview of Icelandic terminology and topical neologisms and thereby makes it easier to coordinate and standardize term usage. Additionally, the Term Bank provides access to Icelandic translations of foreign terms, and access to definitions of terms in Icelandic and other languages. The Term Bank thus benefits all those that write about specialized topics, such as translators, teachers, students, journalists, government agencies, businesses and any interested people, and last but not least compilers of dictionaries.
  Málföng  
Textasafnið inniheldur spurningar um veðrið (meðalstór orðaforði). Heildarorðaforði fyrir þetta ákveðna svið er um 2000 orðmyndir. Hver málhafi les 20 segðir og er mismunandi eftir málhöfum hverjar þær eru.
The text collection contains questions about the weather (medium sized vocabulary). The total vocabulary for this particular topic is about 2000 wordforms. Each speaker reads 20 utterances which differ from speaker to speaker.
  Málföng  
Í málheildinni eru 20 málhafar, 10 kvenmenn og 10 karlmenn. Hljóðskrárnar fyrir hvern málhafa eru í undirmöppum. Í möppu 'm7' eru skrár fyrir karlmann númer 7. Hver málhafi les um það bil 200 setningar úr upplýsingum um veðurfar.
The database includes 20 persons, 10 female and 10 male readers. The sound files for each person are in a subfolder. In folder 'm7' are the files for male reader number 7. Each reader reads roughly 200 sentences which correspond to weather information queries.
  Málföng  
Efnið er í tólf hlutum og er hljóðritað á tímabilinu október 2004 til maí 2005. Upptökurnar eru mislangar, allt frá nokkrum mínútum upp í fáeinar klukkustundir. Í heild eru þær meira en 20 tímar. Hljóðskrárnar eru á mp3-sniði.
The corpus consists of twelve sequences recorded between October 2004 and May 2005. The recordings vary in length, ranging from a few minutes to a few hours. In total they are more than 20 hours in length. The audio files are in MP3 format.
  Málföng  
Í málheildinni eru 20 málhafar, 10 kvenmenn og 10 karlmenn. Hljóðskrárnar fyrir hvern málhafa eru í undirmöppum. Í möppu 'm7' eru skrár fyrir karlmann númer 7. Hver málhafi les um það bil 200 setningar úr upplýsingum um veðurfar.
The database includes 20 persons, 10 female and 10 male readers. The sound files for each person are in a subfolder. In folder 'm7' are the files for male reader number 7. Each reader reads roughly 200 sentences which correspond to weather information queries.
  Málföng  
áratug síðustu aldar. Einnig eru í listanum orð úr stafrænum textum (skáldsögum, blaðagreinum o.þ.h.) sem Baldur Jónsson prófessor (1930-2009) hafði safnað. Byrjað var á að skipta um 10.000 orðum með sérstöku forriti og orðskiptingar síðan leiðréttar handvirkt.
The word list is based on headwords from the Icelandic-Danish Dictionary by Sigfús Blöndal. The words were keyed in the Icelandic Language Institute (which now is a part of the Árni Magnússon Institute for Icelandic Studies) in the middle of the nineteen eighties. The list also contains words from digital texts (novels, newspaper articles etc.) which professor Baldur Jónsson (1930-2009) had collected. The project was started by first hyphenating about 10,000 words with a special program and correcting the hyphenation manually. Result of that work was used to produce rules for hyphenating the next portion which then was corrected manually. This procedure was repeated until all the words had been hyphenated. Various people worked on the manual correction. Baldur made the final decision on hyphenation and finally checked the whole wordlist.
  Málföng  
Gögn verkefnisins eru vistuð í tveimur aðgreindum gagnagrunnum. Grunnurinn Þesárus geymir allan efniviðinn. Valinn hluti gagnanna er afmarkaður og birtur á vefsíðunni ordanet.is.
The database Þesárus contains all the data of the project. Selected part of the data is then delimited and published on the website ordanet.is as a separate entity.
  Málföng  
Fletturnar eru merkingarlega einræðar og það hefur mótandi áhrif á lýsingu merkingarvenslanna. Einræðingin hefur m.a. víðtæk áhrif á flettumyndir sagna þar sem rökliðirnir hverju sinni eru hluti af flettustrengnum og sagnasambönd af ýmsu tagi fá sjálfstæða stöðu innan flettulistans.
The lemmas are semantically unambiguous which has profound impact on the description of the semantic relations. To name an example, the arguments of verbs are taken to be a part of the lemma, and verbal combinations of various kinds have independent status within the lemma list.
  Málföng  
Markamengið sem er notað er það sem var þróað fyrir gerð Íslenskrar orðtíðnibókar með nokkrum breytingum. Sérnöfn eru ekki greind í mannanöfn, staðarnöfn og önnur nöfn. Markinu v var bætt við fyrir vefföng og tölvupóstföng.
The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger2. Lemmatization was performed with the lemmatizer Nefnir3. Tags and lemmas are not manually corrected.
  Málföng  
1Þegar birtar eru niðurstöður rannsókna sem gerðar eru með aðstoð RMH skal hennar getið sem heimildar: Risamálheild. Verkefnisstjórn: Eiríkur Rögnvaldsson, Sigrún Helgadóttir og Steinþór Steingrímsson.
1When publishing results based on the texts in the Icelandic Gigaword Corpus please refer to: The Icelandic Gigaword Corpus. Project management: Eiríkur Rögnvaldsson, Sigrún Helgadóttir and Steinþór Steingrímsson. The Arni Magnússon Institute for Icelandic Studies. Downloaded [DATE] from http://www.malfong.is. The same applies to the release of any language technology tools that have used IGC.
  Málföng  
Fletturnar eru merkingarlega einræðar og það hefur mótandi áhrif á lýsingu merkingarvenslanna. Einræðingin hefur m.a. víðtæk áhrif á flettumyndir sagna þar sem rökliðirnir hverju sinni eru hluti af flettustrengnum og sagnasambönd af ýmsu tagi fá sjálfstæða stöðu innan flettulistans.
The lemmas are semantically unambiguous which has profound impact on the description of the semantic relations. To name an example, the arguments of verbs are taken to be a part of the lemma, and verbal combinations of various kinds have independent status within the lemma list.
  Málföng  
Í Risamálheildinni (RMH) má finna um 1300 milljónir lesmálsorða af textum sem eru geymdir í stöðluðu sniði í rafrænu formi. Orð í textunum eru greind málfræðilega og hverjum texta fylgja bókfræðilegar upplýsingar um verkið sem textinn er úr.
The Icelandic Gigaword corpus (IGC) consists of about 1300 million running words of text. The Gigaword corpus is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.
  Málföng  
Íslenska Risamálheildin er safn um 1300 milljóna lesmálsorða af texta. Hluti textanna eru opinberir textar (t.d. Alþingisræður sem ná aftur til ársins 1907, lagatexti, dómar). Einnig fengust stór textasöfn frá ýmsum fjölmiðlum og ýmsir textar úr textasafni Árnastofnunar.
The Icelandic Gigaword corpus consists of about 1300 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches as far back as 1907, law text, adjudications). The corpus also contains big text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies. The Gigaword corpus is a tagged corpus as described above. The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies. Only texts available in digital form were collected.
  Málföng  
Væntanlegir notendur þurfa að samþykkja sérstakt notkunarleyfi fyrir RMH1 og CC BY leyfi fyrir RMH2. Textarnir eru aðgengilegir í sérstöku xml-sniði, TEI P5, sem er skilgreint af TEI (Text Encoding Initiative).
The corpus will be available in two ways. Firstly the corpus will be available for search where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is based on the Swedish search interface Korp.
  Málföng  
1Þegar birtar eru niðurstöður rannsókna sem gerðar eru með aðstoð RMH skal hennar getið sem heimildar: Risamálheild. Verkefnisstjórn: Eiríkur Rögnvaldsson, Sigrún Helgadóttir og Steinþór Steingrímsson.
1When publishing results based on the texts in the Icelandic Gigaword Corpus please refer to: The Icelandic Gigaword Corpus. Project management: Eiríkur Rögnvaldsson, Sigrún Helgadóttir and Steinþór Steingrímsson. The Arni Magnússon Institute for Icelandic Studies. Downloaded [DATE] from http://www.malfong.is. The same applies to the release of any language technology tools that have used IGC.
  Málföng  
Í Risamálheildinni (RMH) má finna um 1300 milljónir lesmálsorða af textum sem eru geymdir í stöðluðu sniði í rafrænu formi. Orð í textunum eru greind málfræðilega og hverjum texta fylgja bókfræðilegar upplýsingar um verkið sem textinn er úr.
The Icelandic Gigaword corpus (IGC) consists of about 1300 million running words of text. The Gigaword corpus is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.
  Málföng  
Merkingarvenslin sem um ræðir eru af ólíku tagi. Skýrustu og nálægustu venslin eru fólgin í samheitum og andheitum en samheitavenslin eru misjafnlega náin og sá greinarmunur er að nokkru leyti auðkenndur með því að greina á milli samheita og skyldheita.
The semantic relations in question are of various kinds. The clearest and closest relations constitute synonyms and antonyms but the synonym relations vary in closeness. The difference is partially identified by distinguishing between synonyms and near-synonyms. For estimating the relations, the emphasis is laid on the evidence of the material, where the goal is to obtain numeric evidence of semantic proximity and the semantic relatedness of the words compared. The analysis also returns semantically homologous vocabulary which is further sorted and placed under particular concepts and semantic fields.
  Málföng  
Markamengið sem er notað er það sem var þróað fyrir gerð Íslenskrar orðtíðnibókar með nokkrum breytingum. Sérnöfn eru ekki greind í mannanöfn, staðarnöfn og önnur nöfn. Markinu v var bætt við fyrir vefföng og tölvupóstföng.
The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger2. Lemmatization was performed with the lemmatizer Nefnir3. Tags and lemmas are not manually corrected.
  Málföng  
Markamengið sem er notað er það sem var þróað fyrir gerð Íslenskrar orðtíðnibókar með nokkrum breytingum. Sérnöfn eru ekki greind í mannanöfn, staðarnöfn og önnur nöfn. Markinu v var bætt við fyrir vefföng og tölvupóstföng.
The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger2. Lemmatization was performed with the lemmatizer Nefnir3. Tags and lemmas are not manually corrected.
  Málföng  
Í safninu eru upptökur frá umræðutímum á Alþingi veturinn 2004-2005, alls tæplega 21 klukkustund, ásamt nákvæmri umritun þeirra í textaskrám. Auk þess fylgja ýmsar grunnupplýsingar um upptökurnar og þá sem taka til máls, s.s. aldur þeirra og kyn.
The corpus contains recordings from discussion periods at the Icelandic Parliament, during the winter of 2004-2005. The recordings are nearly 21 hours in total and come with detailed transcriptions in text files. Information about the recordings and the speakers, such as their age and gender, are provided as well. The data is intended to reflect natural spoken Icelandic under formal conditions. The discussion periods were chosen as they primarily consist of unprepared speeches that are unlikely to have been written in advance and read out loud. In addition, the aim was on diversity of topics and speakers (w.r.t. their origin, age and gender).
  Málföng  
Um 50% af textum í málheildinni eru fréttir af mbl.is, 10% er sjaldgæfar þrístæður hljóðbúta (tri-phones), 10% er götunöfn, 10% er mannanöfn, 10% er ýmislegt, 10% er nöfn á ríkjum og höfuðborgum og 10% er vefföng.
Google cooperated with Reykjavík University and The Icelandic Centre for Language Technology in collecting voice samples for Icelandic. During the first phase of the project a Text Corpus with sentences was generated. About 50% of the text in the corpus is news stories from the website mbl.is (website of the newspaper Morgunblaðið), 10% is rare tri-phones, 10% is names of streets, 10% is names of people, 10% is miscellaneous, 5% is names of countries and capitals and 5% is URLs. The corpus contains 55,000 sentences. A list containing numbers, dates, times of day, names of days and months, simple questions, and common greetings was also included in the corpus.
  Málföng  
Með markaðri málheild (e. tagged corpus) er átt við safn fjölbreyttra texta sem eru geymdir í stöðluðu sniði í rafrænu formi. Til þess að textarnir verði sem gagnlegastir við málrannsóknir eru þeir greindir á margvíslegan hátt.
A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).
  Málföng  
Texti Heimskringlu er úr útgáfu frá Máli og menningu árið 1991 (Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason og Örnólfur Thorsson (ritstj.) 1991). Stafsetning var umrituð til nútímastafsetningar og nokkrar beygingarendingar eru færðar til nútímamáls.
The texts of the Family Sagas are taken from the publication of Svart á hvítu (Bragi Halldórsson, Jón Torfason, Sverrir Tómasson and Örnólfur Thorsson (eds.) 1985-1986) and also the text of Sturlunga Saga (Örnólfur Thorsson, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.) 1988). The text of Heimskringla is from the publication of Mál og menning from the year 1991 (Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.) 1991). The spelling was normalized to Modern Icelandic spelling and some inflectional endings were changed to Modern Icelandic form. The text of the Book of Settlement is from the publication of Jakob Benediktsson from 1968 (Jakob Benediktsson 1968). The book was scanned and the text normalized to Modern Icelandic spelling in the same way as the other texts. List of the texts can be found here. One of the texts is Íslendingaþættir, a collection of tales, called þættir.
  Málföng  
Í öðrum áfanga veturinn 2016–2017 hlustuðu tveir rannsóknarmenn á óstytta útgáfu af þeim raddsýnum sem var hafnað og flokkuðu þau nánar. Af þeim reyndust 8.548 vera í lagi. Samtals er því talið að 108.568 raddsýni séu í lagi og eru þau aðgengileg á þessari síðu.
In total 127,286 voice samples were recorded, failed recordings were 5,401 resulting in 121,885 voice samples that were evaluated. Before the verification process started new sound files were created by trimming long periods of silence at the beginning and end of the recordings. The total duration of the untrimmed files is about 152 hours but was reduced to about 90 hours. During this process 2,795 files were identified as silent. In the first stage of the verification process therefore 119,090 voice samples were evaluated. 100,020 recordings were accepted as correct, and 19,070 were rejected. During the second stage in the winter of 2016–2017 two evaluators listened to untrimmed versions of the 19,070 recordings that were rejected in stage one and classified them further. Of these samples 8,548 were classified as correct. In total it is considered that 108,568 voice samples are good and are available through this webpage.
  Málföng  
Í umritunarskránum er greint skýrt á milli þeirra sem tala hverju sinni og auk þess eru skráðar þagnir, framígrip, skörun (þar sem fleiri en einn talar í einu) og tiltekin umhverfishljóð (hlátur, ræskingar o.þ.h.).
The recordings were obtained directly from the Parliament. In addition to the audio files, the Parliament provided text files with a preliminary transcription which became the basis for further processing of the material. Then the recordings were listened to again and the transcriptions revised according to methods developed in transcribing Icelandic spoken language. The transcriptions give a word-for-word match of the speeches in normal standardized orthography. Turns are clearly distinguished and linked to different speakers and silences, interruptions, overlaps and certain background noises (laughter, clearing of the throat etc.) are registered in the transcriptions.
  Málföng  
Í safninu eru upptökur frá umræðutímum á Alþingi veturinn 2004-2005, alls tæplega 21 klukkustund, ásamt nákvæmri umritun þeirra í textaskrám. Auk þess fylgja ýmsar grunnupplýsingar um upptökurnar og þá sem taka til máls, s.s. aldur þeirra og kyn.
The corpus contains recordings from discussion periods at the Icelandic Parliament, during the winter of 2004-2005. The recordings are nearly 21 hours in total and come with detailed transcriptions in text files. Information about the recordings and the speakers, such as their age and gender, are provided as well. The data is intended to reflect natural spoken Icelandic under formal conditions. The discussion periods were chosen as they primarily consist of unprepared speeches that are unlikely to have been written in advance and read out loud. In addition, the aim was on diversity of topics and speakers (w.r.t. their origin, age and gender).
  Málföng  
Umritunarskrár úr Alþingisumræðunum ásamt fleira talmálsefni eru opnar til leitar í Íslensku textasafni og mynda einnig hluta af Markaðri íslenskri málheild. Þróaður hefur verið gagnagrunnur og vefviðmót fyrir málheildina og var það umhverfi lagað að þörfum talmálsefnisins.
The transcribed files from the Parliament Speech Corpus, along with other spoken language material, are open for search in Íslenskt textasafn and also form a part of the Tagged Icelandic Corpus (TIC). A database and a web interface have been developed for the TIC and the interface has been adjusted to accommodate the needs of spoken language corpora, so that the search will not only return examples from the transcribed text but also gives access to the relevant examples in the sound files. Spoken language material differs from typical written texts, in that each recording does not only contain the contribution of one „author“ as there are usually more participants, even in material like the Parliament discussions where there is usually only one party speaking at a time. Links to the metadata are, therefore, in many ways more complicated than they are with written texts.
  Málföng  
Í pakkanum eru þrjár skrár: (1) skipti.listi hefur 203.964 orð í stafrófsröð, eitt orð í línu. Sýnt er með bandstriki hvar í orðinu mega vera skil milli lína; (2) hyphen.is er mynsturskrá sem er búin til upp úr orðalistanum.
The package contains three files: (1) skipti.listi contains 203.964 words ordered alphabetically, one word per line. Hyphen in the word shows where the word may be hyphenated; (2) hyphen.is contains patterns generated from the word list. The pattern file is used in hyphenation programs for TeX, groff, OpenOffice and LibreOffice; 3) hyph_is-1.0.oxt can be downloaded from http://extensions.openoffice.org/en/project/icelandic-hyphenation-dictionary. The filename extension .oxt represents "OpenOffice Extension". If a file with the extension .oxt is opened in OpenOffice or LibreOffice it will be added as an add-in to the program in question.
  Málföng  
Umritunarskrár úr Alþingisumræðunum ásamt fleira talmálsefni eru opnar til leitar í Íslensku textasafni og mynda einnig hluta af Markaðri íslenskri málheild. Þróaður hefur verið gagnagrunnur og vefviðmót fyrir málheildina og var það umhverfi lagað að þörfum talmálsefnisins.
The transcribed files from the Parliament Speech Corpus, along with other spoken language material, are open for search in Íslenskt textasafn and also form a part of the Tagged Icelandic Corpus (TIC). A database and a web interface have been developed for the TIC and the interface has been adjusted to accommodate the needs of spoken language corpora, so that the search will not only return examples from the transcribed text but also gives access to the relevant examples in the sound files. Spoken language material differs from typical written texts, in that each recording does not only contain the contribution of one „author“ as there are usually more participants, even in material like the Parliament discussions where there is usually only one party speaking at a time. Links to the metadata are, therefore, in many ways more complicated than they are with written texts.
  Málföng  
Merkingarvenslin sem um ræðir eru af ólíku tagi. Skýrustu og nálægustu venslin eru fólgin í samheitum og andheitum en samheitavenslin eru misjafnlega náin og sá greinarmunur er að nokkru leyti auðkenndur með því að greina á milli samheita og skyldheita.
The semantic relations in question are of various kinds. The clearest and closest relations constitute synonyms and antonyms but the synonym relations vary in closeness. The difference is partially identified by distinguishing between synonyms and near-synonyms. For estimating the relations, the emphasis is laid on the evidence of the material, where the goal is to obtain numeric evidence of semantic proximity and the semantic relatedness of the words compared. The analysis also returns semantically homologous vocabulary which is further sorted and placed under particular concepts and semantic fields.
  Málföng  
Með markaðri málheild (e. tagged corpus) er átt við safn fjölbreyttra texta sem eru geymdir í stöðluðu sniði í rafrænu formi. Til þess að textarnir verði sem gagnlegastir við málrannsóknir eru þeir greindir á margvíslegan hátt.
A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).
  Málföng  
Merkingarvenslin sem um ræðir eru af ólíku tagi. Skýrustu og nálægustu venslin eru fólgin í samheitum og andheitum en samheitavenslin eru misjafnlega náin og sá greinarmunur er að nokkru leyti auðkenndur með því að greina á milli samheita og skyldheita.
The semantic relations in question are of various kinds. The clearest and closest relations constitute synonyms and antonyms but the synonym relations vary in closeness. The difference is partially identified by distinguishing between synonyms and near-synonyms. For estimating the relations, the emphasis is laid on the evidence of the material, where the goal is to obtain numeric evidence of semantic proximity and the semantic relatedness of the words compared. The analysis also returns semantically homologous vocabulary which is further sorted and placed under particular concepts and semantic fields.
  Málföng  
Um 50% af textum í málheildinni eru fréttir af mbl.is, 10% er sjaldgæfar þrístæður hljóðbúta (tri-phones), 10% er götunöfn, 10% er mannanöfn, 10% er ýmislegt, 10% er nöfn á ríkjum og höfuðborgum og 10% er vefföng.
Google cooperated with Reykjavík University and The Icelandic Centre for Language Technology in collecting voice samples for Icelandic. During the first phase of the project a Text Corpus with sentences was generated. About 50% of the text in the corpus is news stories from the website mbl.is (website of the newspaper Morgunblaðið), 10% is rare tri-phones, 10% is names of streets, 10% is names of people, 10% is miscellaneous, 5% is names of countries and capitals and 5% is URLs. The corpus contains 55,000 sentences. A list containing numbers, dates, times of day, names of days and months, simple questions, and common greetings was also included in the corpus.
  Málföng  
Umritunarskrár úr Alþingisumræðunum ásamt fleira talmálsefni eru opnar til leitar í Íslensku textasafni og mynda einnig hluta af Markaðri íslenskri málheild. Þróaður hefur verið gagnagrunnur og vefviðmót fyrir málheildina og var það umhverfi lagað að þörfum talmálsefnisins.
The transcribed files from the Parliament Speech Corpus, along with other spoken language material, are open for search in Íslenskt textasafn and also form a part of the Tagged Icelandic Corpus (TIC). A database and a web interface have been developed for the TIC and the interface has been adjusted to accommodate the needs of spoken language corpora, so that the search will not only return examples from the transcribed text but also gives access to the relevant examples in the sound files. Spoken language material differs from typical written texts, in that each recording does not only contain the contribution of one „author“ as there are usually more participants, even in material like the Parliament discussions where there is usually only one party speaking at a time. Links to the metadata are, therefore, in many ways more complicated than they are with written texts.
  Málföng  
Með markaðri málheild (e. tagged corpus) er átt við safn fjölbreyttra texta sem eru geymdir í stöðluðu sniði í rafrænu formi. Til þess að textarnir verði sem gagnlegastir við málrannsóknir eru þeir greindir á margvíslegan hátt.
A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).
Arrow 1 2 3 4 5 6 7 8 9 10 Arrow