 |
|
| Computers Forum Index » Computer Languages (Misc) » Which is better - a char type or a string of length... |
|
Page 1 of 1 |
|
| Author |
Message |
| James Harris... |
Posted: Sun Aug 30, 2009 10:52 am |
|
|
|
Guest
|
On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
....
Quote: Suppose you had a string say s="ABCDEF", and you indexed it using:
s[3]
would the result be a character, or a string of length 1?
(For years I've been using a language with the latter approach, and it's
worked well (after all why should s[3] be that different from the slice
s[3..4]), with asc(s[3]) to get the character value.)
But which is better?
I don't know which is best but I think the question deserves its own
thread.
Were you querying the existence of a separate char data type or simply
the result of a substring reference?
Perhaps the answer depends on string representation. C supports a char
data type but its strings have a terminating zero so C has to
distinguish, doesn't it? A character occupies a single cell but a
string of one character needs at least two cells.
James |
|
|
| Back to top |
|
|
|
| Bart... |
Posted: Sun Aug 30, 2009 12:13 pm |
|
|
|
Guest
|
On Aug 30, 11:52 am, James Harris <james.harri... at (no spam) googlemail.com>
wrote:
Quote: On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
Suppose you had a string say s="ABCDEF", and you indexed it using:
s[3]
would the result be a character, or a string of length 1?
...But which is better?
Were you querying the existence of a separate char data type or simply
the result of a substring reference?
No, just whether a string should be considered formally as it's own
entity, or as an array of characters.
With arrays, it's expected that (10,20,30)[2] should result in the
number 20, not the array (20,). With "ABC"[2], it's not so clear
whether the result is "B", or char 66 (using 1-based indexing here).
(I'm working on a new language where strings are more formally
character arrays, but I'd quite like to keep my nice, cosy string
indexing where everything stays a string.)
Quote: Perhaps the answer depends on string representation. C supports a char
data type but its strings have a terminating zero so C has to
distinguish, doesn't it? A character occupies a single cell but a
string of one character needs at least two cells.
At the C level, the answer is easy; with a formal type system,
indexing a string produces a char element. I was thinking at the
higher level, such as Basic's MID$ that was mentioned, which also
produces 1-character strings. (Did I just describe Basic as higher
level than C? I suppose it is.)
--
Bartc |
|
|
| Back to top |
|
|
|
| Bart... |
Posted: Sun Aug 30, 2009 11:45 pm |
|
|
|
Guest
|
On Aug 30, 9:32 pm, "Dmitry A. Kazakov" <mail... at (no spam) dmitry-kazakov.de>
wrote:
Quote: On Sun, 30 Aug 2009 03:52:29 -0700 (PDT), James Harris wrote:
On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
...
Suppose you had a string say s="ABCDEF", and you indexed it using:
s[3]
would the result be a character, or a string of length 1?
(For years I've been using a language with the latter approach, and it's
worked well (after all why should s[3] be that different from the slice
s[3..4]), with asc(s[3]) to get the character value.)
But which is better?
I don't know which is best but I think the question deserves its own
thread.
Were you querying the existence of a separate char data type or simply
the result of a substring reference?
Depends on the semantics of []. There are two different operations,
substring extraction and element extraction. [] can mean either. Yet both
are required.
The two operations can be distinguished by the type (or format) of the
index: if it's a number, then it's usually element extraction; if a
range, then a substring or slice (and the index could also be a set or
list for more elaborate selections).
The question was, given a scalar index, whether the result should be
an element, or a short substring, when applied to a string.
--
Bartc |
|
|
| Back to top |
|
|
|
| Dmitry A. Kazakov... |
Posted: Mon Aug 31, 2009 12:32 am |
|
|
|
Guest
|
On Sun, 30 Aug 2009 03:52:29 -0700 (PDT), James Harris wrote:
Quote: On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
...
Suppose you had a string say s="ABCDEF", and you indexed it using:
s[3]
would the result be a character, or a string of length 1?
(For years I've been using a language with the latter approach, and it's
worked well (after all why should s[3] be that different from the slice
s[3..4]), with asc(s[3]) to get the character value.)
But which is better?
I don't know which is best but I think the question deserves its own
thread.
Were you querying the existence of a separate char data type or simply
the result of a substring reference?
Depends on the semantics of []. There are two different operations,
substring extraction and element extraction. [] can mean either. Yet both
are required.
Quote: Perhaps the answer depends on string representation. C supports a char
data type but its strings have a terminating zero so C has to
distinguish, doesn't it? A character occupies a single cell but a
string of one character needs at least two cells.
Representation is of no matter. A substring type statically constrained to
one character may have exactly same representation as one string element.
The difference is that a set cannot be a member of itself. Hence in a typed
language the type of string elements shall differ from the string type
itself and the types of string slices. I.e. member extraction and slicing
are necessarily different.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de |
|
|
| Back to top |
|
|
|
| Rod Pemberton... |
Posted: Mon Aug 31, 2009 5:13 am |
|
|
|
Guest
|
"James Harris" <james.harris.1 at (no spam) googlemail.com> wrote in message
news:2c5332fa-ef53-4f17-bed9-aa3c1180591a at (no spam) p23g2000vbl.googlegroups.com...
Quote: On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
...
Suppose you had a string say s="ABCDEF", and you indexed it using:
For C, s could be declared a few ways, e.g.:
char *s="ABCDEF";
char s[]="ABCDEF";
char s[6]="ABCDEF";
You could also declare the string to be larger or smaller than the
initializing string:
char s[5]="ABCDEF";
char s[7]="ABCDEF";
Some compilers will truncate s[5] and append a null '\0'. Some will only
truncate. I'm not sure what's legally required by the C spec.'s. You'll
have to look it up.
Quote: s[3]
would the result be a character, or a string of length 1?
For C, it will be a character. It will not be a string. The subscript
operator, [], always dereferences the pointer arithmetic. I.e., because of
the dereference, it can't be a pointer and therefore can't be a reference to
a string. It must return data at subscript 3, in this case, a char. a[b]
is equivalent to *((a)+(b)) for characters (i.e., scaling of 1) where one is
a pointer and the other is an offset/subscript. If you want a string at
s[3], you must take the address of it, i.e., &s[3]. If declared as
s[]="ABCDEF", the nul is appended to s after the F char, i.e., s[6]. &s[3]
will not be a string of length one. It could be a string of length 3,
length 2, etc. length, or unknown, depending on how it is declared and the
compiler used. The length 2 comes from the s[5] if truncated and terminated
by '\0'. The unknown is if the compiler doesn't terminate the truncated
declaration with '\0'.
C's strings are terminated by a C byte, not a character, of all bits zero.
It's a C byte, not a char, for platforms with differently sized C bytes and
chars, e.g., 16 vs. 8. This nul byte in C is represented by '\0' and is
usually automatically appended to the end of strings. (As noted above, some
compilers don't do this. I'm not sure of the exact spec. legal issues
here.)
RP |
|
|
| Back to top |
|
|
|
| Dmitry A. Kazakov... |
Posted: Tue Sep 01, 2009 12:25 pm |
|
|
|
Guest
|
On Mon, 31 Aug 2009 15:01:41 -0700 (PDT), James Harris wrote:
Quote: Unfortunately I don't undestand what you are saying. Is this an
idealogic objection or a pragmatic one?
A pragmatic one, I thought about the consequences of making string an
element of itself.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de |
|
|
| Back to top |
|
|
|
| James Harris... |
Posted: Wed Sep 02, 2009 5:20 pm |
|
|
|
Guest
|
On 1 Sep, 09:25, "Dmitry A. Kazakov" <mail... at (no spam) dmitry-kazakov.de>
wrote:
Quote: On Mon, 31 Aug 2009 15:01:41 -0700 (PDT), James Harris wrote:
Unfortunately I don't undestand what you are saying. Is this an
idealogic objection or a pragmatic one?
A pragmatic one, I thought about the consequences of making string an
element of itself.
Well, in that case, what are you saying about it? Are you saying it is
impossible? If so, why? Or is your point that it should not be used in
any language, or that it's fine for some languages but not others?
Or does it depend on the operation? For example, slicing is OK but
concatenation and comparison aren't?
To make something concrete to discuss, here's a possible code sequence
using slicing (slicing and indexing were Bart's initial questions):
s = "john smith"
pos = s.find(" ", ":") #To find a space or a colon delimiter
first = s[.. pos - 1] #The string up to the char before pos
delim = s[pos]
last = s[pos + 1 ..]
Now, given that we have three result strings (one of which is only one
character long) we could, perhaps, get the first element of each as
first[0]
delim[0]
last[0]
I /think/ you are saying we shouldn't be able to take the first
element of delim as it is already just one character long. That's fine
if such a restriction is needed. If the restriction is not necessary,
however, permitting such an operation allows a certain consistency -
the first element of a string of length one is also a string of length
one.
To implement that we would index offset (0 * ElementSize) i.e. offset
0. I guess that could be applied to any scalar if the language
designer thought it useful - i.e. treat any scalar as the first
element of an array of elements of its type. That's way beyond what I
had in mind. I was just offering a suggestion to Bart in case it would
help the design choice in his new language. Unless you say this is
impossible it's up to him to use it or not as he prefers.
The alternative action, disallowing indexing on a single-character
string such as delim, seems fine too if it is appropriate.
Do you see a problem with slicing, as above? Or is your concern with
string construction? So:
first = "john"
last = "smith"
name = first + " " + last
Or are you talking about having an array of strings being
automatically flattened such as
a[0] = "john"
a[1] = " "
a[2] = "smith"
if (a == "john smith") ...
I'm not sure about this latter example which I think is closest to the
ones you wrote. Perhaps it depends on the rest of the language but
most naturally a[0] seems to equal "john" and a[0][0] equals "j". I
can't see any need to run them together.
Perhaps a third way is to recognise that a string can always be
subsliced to a shorter or equal length version. A substring of length
4 is extractable from a string of length 7 but not vice-versa. Perhaps
there are some dynamic typing issues here to deal with . I don't know.
It seems to depend on the designer's needs and the rest of his
language - which is why I left the question open. Having no personal
use for this I was not interested enough to work through all the
ramifications.
James |
|
|
| Back to top |
|
|
|
| Bart... |
Posted: Wed Sep 02, 2009 8:19 pm |
|
|
|
Guest
|
On Sep 2, 6:20 pm, James Harris <james.harri... at (no spam) googlemail.com> wrote:
Quote: On 1 Sep, 09:25, "Dmitry A. Kazakov" <mail... at (no spam) dmitry-kazakov.de
wrote:
On Mon, 31 Aug 2009 15:01:41 -0700 (PDT), James Harris wrote:
Unfortunately I don't undestand what you are saying. Is this an
idealogic objection or a pragmatic one?
A pragmatic one, I thought about the consequences of making string an
element of itself.
Well, in that case, what are you saying about it? Are you saying it is
impossible? If so, why? Or is your point that it should not be used in
any language, or that it's fine for some languages but not others?
Or does it depend on the operation? For example, slicing is OK but
concatenation and comparison aren't?
To make something concrete to discuss, here's a possible code sequence
using slicing (slicing and indexing were Bart's initial questions):
I've realised that there are a few other issues in my design, and
perhaps it's best to treat the various indexable objects separately.
Aside from arrays and strings, I can also index integers (trust me,
it's useful..):
a := (100,200,300,400) # array of 4 ints
b := a[1] # b is a single int 100 (not an array of 1
int)
c := a[1..3] # c is a slice/array of 3 ints (100,200,300)
d := b[1] # (d is an int too, but has value 0 now; bit 1
of 100)
a := "ABCDEF" # a is a string of 6 characters
b := a[1] # b is a string "A" of 1 character (1-based)
c := a[1..3] # c is a string of 3 characters
d := b[1] # d is a string "A" of 1 character
(repeatedly..)
a := 123 # a is an integer (111_1011B)
b := a[1] # b is an integer 1 (bit 1 of 123, 0-based)
c := a[1..3] # c is an integer 5 (bits 1 to 3 of 123)
d := b[1] # d is an integer 0 (bit 1 of 1)
So strings and integers share the same property that index and slice
operations on them still yield a string or integer.
On arrays of type T (although the language uses variants), indexing
and slicing yield respectively objects of T, and array of T.
A single-element slice of an array can be achieved with a single index
using (a[i],), (although a[i..i] is more efficient, it looks slightly
more naff).
And the int-char value of a 1-element string is obtained with a
special function (asc(a[i])).
Strings of course can therefore be indexed any number of times:
"ABC"[1,1,1,1,1,....] but this is not harmful. Which brings me to
another concern:
Should the construction A[I,J,K] be multiple dimension array indexing,
or should I insist on A[I][J][K] as C does? Then A[I,J,K] is useful
new syntax.
Thanks,
--
Bartc |
|
|
| Back to top |
|
|
|
| James Harris... |
Posted: Wed Sep 02, 2009 9:35 pm |
|
|
|
Guest
|
On 2 Sep, 21:19, Bart <b... at (no spam) freeuk.com> wrote:
Quote: On Sep 2, 6:20 pm, James Harris <james.harri... at (no spam) googlemail.com> wrote:
On 1 Sep, 09:25, "Dmitry A. Kazakov" <mail... at (no spam) dmitry-kazakov.de
wrote:
On Mon, 31 Aug 2009 15:01:41 -0700 (PDT), James Harris wrote:
Unfortunately I don't undestand what you are saying. Is this an
idealogic objection or a pragmatic one?
A pragmatic one, I thought about the consequences of making string an
element of itself.
Well, in that case, what are you saying about it? Are you saying it is
impossible? If so, why? Or is your point that it should not be used in
any language, or that it's fine for some languages but not others?
Or does it depend on the operation? For example, slicing is OK but
concatenation and comparison aren't?
To make something concrete to discuss, here's a possible code sequence
using slicing (slicing and indexing were Bart's initial questions):
I've realised that there are a few other issues in my design, and
perhaps it's best to treat the various indexable objects separately.
Aside from arrays and strings, I can also index integers (trust me,
it's useful..):
a := (100,200,300,400) # array of 4 ints
b := a[1] # b is a single int 100 (not an array of 1
int)
c := a[1..3] # c is a slice/array of 3 ints (100,200,300)
d := b[1] # (d is an int too, but has value 0 now; bit 1
of 100)
a := "ABCDEF" # a is a string of 6 characters
b := a[1] # b is a string "A" of 1 character (1-based)
c := a[1..3] # c is a string of 3 characters
d := b[1] # d is a string "A" of 1 character
(repeatedly..)
a := 123 # a is an integer (111_1011B)
b := a[1] # b is an integer 1 (bit 1 of 123, 0-based)
c := a[1..3] # c is an integer 5 (bits 1 to 3 of 123)
d := b[1] # d is an integer 0 (bit 1 of 1)
I've tried to make sense of some of these but not all of them. Are you
indexing bits from the left or from the right and are you using these
0-based or 1-based?
Should the bit index operation be c := a[3..1]? What if it is?
Quote:
So strings and integers share the same property that index and slice
operations on them still yield a string or integer.
On arrays of type T (although the language uses variants), indexing
and slicing yield respectively objects of T, and array of T.
A single-element slice of an array can be achieved with a single index
using (a[i],), (although a[i..i] is more efficient, it looks slightly
more naff).
And the int-char value of a 1-element string is obtained with a
special function (asc(a[i])).
Strings of course can therefore be indexed any number of times:
"ABC"[1,1,1,1,1,....] but this is not harmful. Which brings me to
another concern:
Should the construction A[I,J,K] be multiple dimension array indexing,
or should I insist on A[I][J][K] as C does? Then A[I,J,K] is useful
new syntax.
Not sure. A[i][j][k] looks good to me though it's closer to my syntax
which would sway my opinion.
James |
|
|
| Back to top |
|
|
|
| Bart... |
Posted: Wed Sep 02, 2009 10:18 pm |
|
|
|
Guest
|
On Sep 2, 10:35 pm, James Harris <james.harri... at (no spam) googlemail.com>
wrote:
Quote: On 2 Sep, 21:19, Bart <b... at (no spam) freeuk.com> wrote:
a := 123 # a is an integer (111_1011B)
b := a[1] # b is an integer 1 (bit 1 of 123, 0-based)
c := a[1..3] # c is an integer 5 (bits 1 to 3 of 123)
d := b[1] # d is an integer 0 (bit 1 of 1)
I've tried to make sense of some of these but not all of them. Are you
indexing bits from the left or from the right and are you using these
0-based or 1-based?
Should the bit index operation be c := a[3..1]? What if it is?
Integers are indexed from bit 0 (least significant bit). The index
order doesn't matter (they are normalised in the runtime).
The point though is that no matter how often you index/slice an int,
the result is always an int. So need to keep this indexing distinct
from normal array indexing.
(Although I'm also hoping to have bit arrays (I already use sets),
just to muddy things a little; indexing a bit array would need to
return an int of value 0 or 1, but a slice would need to stay a bit-
array.)
Quote: Should the construction A[I,J,K] be multiple dimension array indexing,
or should I insist on A[I][J][K] as C does? Then A[I,J,K] is useful
new syntax.
Not sure. A[i][j][k] looks good to me though it's closer to my syntax
which would sway my opinion.
OK, that's good. This will of course only save some punctuation: A
[(i,j,k)] will do the same thing, but A[i,j,k] looks a more natural
way of selecting a subset of 3 items, and will I believe be more
efficient to implement, as there is no intermediate list to create.
--
Bartc |
|
|
| Back to top |
|
|
|
| Dmitry A. Kazakov... |
Posted: Wed Sep 02, 2009 10:24 pm |
|
|
|
Guest
|
On Wed, 2 Sep 2009 10:20:18 -0700 (PDT), James Harris wrote:
Quote: On 1 Sep, 09:25, "Dmitry A. Kazakov" <mail... at (no spam) dmitry-kazakov.de
wrote:
On Mon, 31 Aug 2009 15:01:41 -0700 (PDT), James Harris wrote:
Unfortunately I don't undestand what you are saying. Is this an
idealogic objection or a pragmatic one?
A pragmatic one, I thought about the consequences of making string an
element of itself.
Well, in that case, what are you saying about it? Are you saying it is
impossible? If so, why? Or is your point that it should not be used in
any language, or that it's fine for some languages but not others?
Or does it depend on the operation? For example, slicing is OK but
concatenation and comparison aren't?
To make something concrete to discuss, here's a possible code sequence
using slicing (slicing and indexing were Bart's initial questions):
s = "john smith"
pos = s.find(" ", ":") #To find a space or a colon delimiter
first = s[.. pos - 1] #The string up to the char before pos
delim = s[pos]
last = s[pos + 1 ..]
Now, given that we have three result strings (one of which is only one
character long) we could, perhaps, get the first element of each as
first[0]
delim[0]
last[0]
What are the types of pos and 0 in this context?
Quote: I /think/ you are saying we shouldn't be able to take the first
element of delim as it is already just one character long.
No. In a properly designed language all strings should be treated equally.
[...]
Quote: Or are you talking about having an array of strings being
automatically flattened such as
a[0] = "john"
a[1] = " "
a[2] = "smith"
if (a == "john smith") ...
I'm not sure about this latter example which I think is closest to the
ones you wrote. Perhaps it depends on the rest of the language but
most naturally a[0] seems to equal "john" and a[0][0] equals "j". I
can't see any need to run them together.
If {"a", "b"} = "ab" = {"ab"} then trivially:
"ab"[0] = {"ab"}[0] = "ab"
"ab"[0] = {"a", "b"}[0] = "a"
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de |
|
|
| Back to top |
|
|
|
| Mike Austin... |
Posted: Thu Nov 05, 2009 2:32 pm |
|
|
|
Guest
|
I realize this is an old thread, but as I found the messages insightful. (post
is below)
Torben Ægidius Mogensen wrote:
Quote: James Harris <james.harris.1 at (no spam) googlemail.com> writes:
On 27 Aug, 11:44, "bartc" <ba... at (no spam) freeuk.com> wrote:
would the result be a character, or a string of length 1?
(For years I've been using a language with the latter approach, and it's
worked well (after all why should s[3] be that different from the slice
s[3..4]), with asc(s[3]) to get the character value.)
But which is better?
Were you querying the existence of a separate char data type or simply
the result of a substring reference?
Perhaps the answer depends on string representation. C supports a char
data type but its strings have a terminating zero so C has to
distinguish, doesn't it? A character occupies a single cell but a
string of one character needs at least two cells.
I'm in favour of having separate string and character types, primarily
because you by specifying a character type specify the expectation that
you get exactly one character in a way that can be cehcked statically.
If you just specify a string type, you would need runtime checks to
verify that there is exactly one character.
As I was reading this thread, I was going back and forth between the two
options. Ultimately, "abc"[0] => 'a', "abc"[0..0] => "a" feels right for me.
Ironically, from Ruby 1.8 to 1.9 they changed "abc"[0] to return a 1 character
string instead of an int.
Mike
Quote: Additionally, a string is a sequence of characetrs, so a good solution
is to have the language suport a generic sequence type and let strings
be one instance of this. For example, in Pascal, strings were (at least
originally) just packed arrays of characters, and in C they are arrays
of characters with 0 termination. An array is, however, not a very good
sequence representation. Haskell uses lists of characters as strings,
which is better, but random access to string elements becomes slow. I
would rather see a generic sequence type where random access,
concatenation and substring are all at most O(log(n)). This can be
achieved by using a tree structure.
Torben |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Wed Dec 02, 2009 4:02 pm
|
|