Eng
lish
Kontakt
Išči
Meni
Tečaji za odrasle
Izpiti
Knjige
Za otroke
Na tujih univerzah
Seminar SJLK
Simpozij obdobja
Izobraževanja za učitelje
34. Obdobja - 1. del:
Pojdi na
1
2-3
4-5
6-7
8-9
10-11
12-13
14-15
16-17
18-19
20-21
22-23
24-25
26-27
28-29
30-31
32-33
34-35
36-37
38-39
40-41
42-43
44-45
46-47
48-49
50-51
52-53
54-55
56-57
58-59
60-61
62-63
64-65
66-67
68-69
70-71
72-73
74-75
76-77
78-79
80-81
82-83
84-85
86-87
88-89
90-91
92-93
94-95
96-97
98-99
100-101
102-103
104-105
106-107
108-109
110-111
112-113
114-115
116-117
118-119
120-121
122-123
124-125
126-127
128-129
130-131
132-133
134-135
136-137
138-139
140-141
142-143
144-145
146-147
148-149
150-151
152-153
154-155
156-157
158-159
160-161
162-163
164-165
166-167
168-169
170-171
172-173
174-175
176-177
178-179
180-181
182-183
184-185
186-187
188-189
190-191
192-193
194-195
196-197
198-199
200-201
202-203
204-205
206-207
208-209
210-211
212-213
214-215
216-217
218-219
220-221
222-223
224-225
226-227
228-229
230-231
232-233
234-235
236-237
238-239
240-241
242-243
244-245
246-247
248-249
250-251
252-253
254-255
256-257
258-259
260-261
262-263
264-265
266-267
268-269
270-271
272-273
274-275
276-277
278-279
280-281
282-283
284-285
286-287
288-289
290-291
292-293
294-295
296-297
298-299
300-301
302-303
304-305
306-307
308-309
310-311
312-313
314-315
316-317
318-319
320-321
322-323
324-325
326-327
328-329
330-331
332-333
334-335
336-337
338-339
340-341
342-343
344-345
346-347
348-349
350-351
352-353
354-355
356-357
358-359
360-361
362-363
364-365
366-367
368-369
370-371
372-373
374-375
376-377
378-379
380-381
382-383
384-385
386-387
388-389
390-391
392-393
394-395
396-397
398-399
400-401
402-403
404-405
406-407
408-409
410-411
412-413
414-415
416-417
418-419
420-421
422-423
424-425
426-427
428-429
430-431
432
Simpozij OBDOBJA34 meet their communicative needs (Tagg 2012), as well as being a way of reflecting their identity and speech style in writing (Herring 2001). Studying UGC language is valuable for linguists, but is also beneficial for improving automatic processing of UGC, which has proven quite difficult as consistent decreases in performance on UGC have been recorded in the entire text processing chain, from part-of-speech tagging (Gimpel et al. 2011) to sentence parsing (Petrov, McDonald 2012). Thenonstandard linguistic features of UGC have been analyzed both qualitatively andquantitatively (Eisenstein 2013; Hu et al. 2013), and have been taken into account in automatic text processing applications which either strive to normalize nonstandard features (Liu et al. 2011), adapt standard tools to work on nonstandard data (Gimpel et al. 2011), or use pre-processing steps to tackle UGC-specific phenomena (Foster et al. 2011). However, to the best of our knowledge, the level of (non)standardness of UGC has not been compared across languages, and the extent to which the observed phenomena are universal (versus language-specific) in this type of communication has not been established. In this paper we present an experiment in which we manually annotate and analyze the (non)standardness level of tweets in Slovene, Croatian and Serbian, and then use manual annotation to train a regression model whichautomaticallypredicts the level of standardness of texts in a corpus. We believe this will be very useful for linguistic analyses, as well as at all stages of text processing. 2 Corpus construction and sampling The corpus used in the experiment comprises Slovene, Croatian and Serbian tweets harvested with TweetCat (Ljube{i} et al. 2014), a custom-built tool for collect- ing tweets written in lesser-used languages. The collection of tweets for all three languages took place from 2013 to 2015, resulting in a corpus of about 61 million tokens in Slovene, 25 million tokens in Croatian and 205 million tokens in Serbian, after deduplication and the filtering of foreign-language tweets and tweets without linguistically relevant content (i.e. those containing only photos, links, or emoticons). The corpus is linguistically annotated; for Slovene, tokenizing, MSD tagging and lemmatization were performed with ToTaLe (Erjavec et al. 2005), while for Croatian and Serbian we used the tagger/lemmatizer constructed by Agi} et al. (2013). It is interesting to note the differences in size between the three sub-corpora. While the amount of data for Slovene and Serbian is roughly proportional to the number of their speakers (2 million for Slovene and 7 million for Serbian), there are twice as manyspeakers of Croatian (4 million) but two times fewer Croatian tweets compared to Slovene. Initial examination of the collected tweets showed that the corpus is heavily skewed towards standard language, especially in Slovene and Croatian, where Twitter is frequently used for dissemination of information by news agencies and - other official accounts, which, unsurprisingly, tweet in standard language. We there fore prepared a more balanced sample for manual annotation by relying on a simple heuristic which measures the rate of out-of-vocabulary words (i.e. word forms not 226